Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen,Weiyun Wang,Yue Cao,Yangzhou Liu,Zhangwei Gao,Erfei Cui,Jinguo Zhu,Shenglong Ye,Hao Tian,Zhaoyang Liu,Lixin Gu,Xuehui Wang,Qingyun Li,Yimin Ren,Zixuan Chen,Jiapeng Luo,Jiahao Wang,Tan Jiang,Bo Wang,Conghui He,Botian Shi,Xingcheng Zhang,Han Lv,Yi Wang,Wenqi Shao,Pei Chu,Zhongying Tu,Tong He,Zhiyong Wu,Huipeng Deng,Jiaye Ge,Kai Chen,Min Dou,Lewei Lu,Xizhou Zhu,Tong Lu,Dahua Lin,Yu Qiao,Jifeng Dai,Wenhai Wang
2024-12-07
Abstract:We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see <a class="link-external link-https" href="https://huggingface.co/spaces/OpenGVLab/InternVL" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: 1. **Narrow the performance gap between open - source multimodal large language models (MLLM) and commercial closed - source models**: Although existing open - source multimodal models such as the InternVL series and the Qwen - VL series provide high - performance and transparent alternatives, they are still inferior to commercial closed - source models such as GPT - 4o and Claude - 3.5 - Sonnet in terms of performance and efficiency. By introducing InternVL 2.5, the paper aims to improve the performance of open - source multimodal models by systematically exploring factors such as model expansion, data quality, and test - time strategies. 2. **Study the relationship between the expansion of different components in multimodal models and performance**: The paper explores how factors such as visual encoders, language models, dataset size, and inference time affect the overall performance of multimodal models. Specific findings include: - **Large - scale visual encoders significantly reduce the dependence on training data**: For example, InternVL 2.5 uses a 6B visual encoder and can achieve better performance than Qwen2 - VL - 72B (equipped with a 600M visual encoder) with only 1/10 of the training tokens. - **The importance of data quality**: From InternVL 2.0 to 2.5, although the dataset size has doubled, strict filtering has greatly improved the data quality, especially in Chain - of - Thought (CoT) reasoning tasks and complex challenges (such as OlympiadBench). - **Test - time expansion is beneficial for difficult multimodal question - answering**: For challenging tasks such as MMMU, InternVL 2.5 achieved an accuracy rate of 70.1% through CoT reasoning, which is 3.7 percentage points higher than the direct response. 3. **Provide powerful open - source tools to promote the development of multimodal AI systems**: By releasing InternVL 2.5, the paper hopes to contribute a powerful tool to the open - source community and encourage further research and applications. InternVL 2.5 performs well in multiple benchmark tests, especially becoming the first open - source MLLM to exceed 70% accuracy on the MMMU validation set, demonstrating the potential of open - source solutions in advancing multimodal AI. These goals not only improve the performance of multimodal models but also provide valuable resources and technical support for the open - source community.