Abstract:We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see <a class="link-external link-https" href="https://huggingface.co/spaces/OpenGVLab/InternVL" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problems that this paper attempts to solve are: 1. **Narrow the performance gap between open - source multimodal large language models (MLLM) and commercial closed - source models**: Although existing open - source multimodal models such as the InternVL series and the Qwen - VL series provide high - performance and transparent alternatives, they are still inferior to commercial closed - source models such as GPT - 4o and Claude - 3.5 - Sonnet in terms of performance and efficiency. By introducing InternVL 2.5, the paper aims to improve the performance of open - source multimodal models by systematically exploring factors such as model expansion, data quality, and test - time strategies. 2. **Study the relationship between the expansion of different components in multimodal models and performance**: The paper explores how factors such as visual encoders, language models, dataset size, and inference time affect the overall performance of multimodal models. Specific findings include: - **Large - scale visual encoders significantly reduce the dependence on training data**: For example, InternVL 2.5 uses a 6B visual encoder and can achieve better performance than Qwen2 - VL - 72B (equipped with a 600M visual encoder) with only 1/10 of the training tokens. - **The importance of data quality**: From InternVL 2.0 to 2.5, although the dataset size has doubled, strict filtering has greatly improved the data quality, especially in Chain - of - Thought (CoT) reasoning tasks and complex challenges (such as OlympiadBench). - **Test - time expansion is beneficial for difficult multimodal question - answering**: For challenging tasks such as MMMU, InternVL 2.5 achieved an accuracy rate of 70.1% through CoT reasoning, which is 3.7 percentage points higher than the direct response. 3. **Provide powerful open - source tools to promote the development of multimodal AI systems**: By releasing InternVL 2.5, the paper hopes to contribute a powerful tool to the open - source community and encourage further research and applications. InternVL 2.5 performs well in multiple benchmark tests, especially becoming the first open - source MLLM to exceed 70% accuracy on the MMMU validation set, demonstrating the potential of open - source solutions in advancing multimodal AI. These goals not only improve the performance of multimodal models but also provide valuable resources and technical support for the open - source community.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

NVLM: Open Frontier-Class Multimodal LLMs

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Efficient Multimodal Large Language Models: A Survey

InternLM2 Technical Report

A Survey on Benchmarks of Multimodal Large Language Models

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output