What matters when building vision-language models?

Hugo Laurençon,Léo Tronchon,Matthieu Cord,Victor Sanh
DOI: https://doi.org/10.48550/arXiv.2405.02246
2024-05-04
Abstract:The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper discusses key decisions in building Visual Language Models (VLMs), such as pre-training models, architecture selection, data, and training methods. The research finds that these decisions lack sufficient validation of their impact on model performance, hindering the progress of the field. The paper experimentally confirms the advantages of pre-training base models and fully autoregressive architecture, and proposes an efficient base VLM called Idefics2, which performs well in multiple multi-modal benchmark tests. Additionally, the study reveals the importance of factors such as post-pretraining backbone quality, training stability, and efficiency improvement.