Molmo and PixMo

Oct 18, 2024 · 8m 8s

Description

🔓 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

This research paper introduces Molmo, a new family of vision-language models (VLMs) that surpasses existing open-weight models in performance while maintaining open weights, data, and code. The key innovation is the collection of a large, detailed image caption dataset using speech-based descriptions, avoiding reliance on synthetic data generated by proprietary VLMs. Molmo is trained on this dataset, along with a diverse mixture of fine-tuning datasets, to achieve state-of-the-art performance on multiple academic benchmarks and human evaluation, even compared to proprietary systems like GPT-4o. The paper emphasizes the importance of open research and provides a comprehensive overview of the model architecture, data collection methods, training process, and evaluation results.

📎 Link to paper
🟣 Try their demo

show less