Abstract:The rapid advancement of Large Language Models (LLMs) has led to an influx of efforts to extend their capabilities to multimodal tasks. Among them, growing attention has been focused on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. Despite the structural simplicity and deployment-friendliness, training a monolithic MLLM with promising performance still remains challenging. In particular, the popular approaches adopt continuous pre-training to extend a pre-trained LLM to a monolithic MLLM, which suffers from catastrophic forgetting and leads to performance degeneration. In this paper, we aim to overcome this limitation from the perspective of delta tuning. Specifically, our core idea is to embed visual parameters into a pre-trained LLM, thereby incrementally learning visual knowledge from massive data via delta tuning, i.e., freezing the LLM when optimizing the visual parameters. Based on this principle, we present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure. Moreover, we propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP). In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data. To validate our approach, we conduct extensive experiments on 16 benchmarks. Experimental results not only validate the superior performance of Mono-InternVL compared to the state-of-the-art MLLM on 6 multimodal benchmarks, e.g., +113 points over InternVL-1.5 on OCRBench, but also confirm its better deployment efficiency, with first token latency reduced by up to 67%.

What problem does this paper attempt to address?

The paper attempts to address the issue of overcoming catastrophic forgetting during continuous pre-training in Monolithic Multimodal Large Language Models (MLLMs) to enhance the model's visual and language capabilities. Specifically, existing monolithic MLLMs often suffer from performance degradation due to catastrophic forgetting when extending pre-trained language models through continuous pre-training. The paper proposes a novel approach by introducing an independent set of visual parameters and employing partial parameter tuning for visual pre-training to retain pre-trained language knowledge while enhancing visual learning capabilities. ### Specific Issues: 1. **Catastrophic Forgetting**: During continuous pre-training, optimizing for visual tasks can impair existing language capabilities. 2. **Performance Enhancement**: How to significantly improve performance on visual tasks while maintaining language capabilities. 3. **Deployment Efficiency**: How to design a simple and efficient monolithic MLLM for practical deployment. ### Solutions: - **Delta Tuning**: By freezing the pre-trained language model parameters and only optimizing the newly added visual parameters, catastrophic forgetting is avoided. - **Visual Expert Integration**: Embedding a set of visual experts within the pre-trained language model, using a Multimodal Mixture-of-Experts (MoE) structure to handle both visual and textual information. - **Endogenous Visual Pre-training (EViP)**: A phased pre-training strategy is designed, gradually progressing from basic visual concept learning to advanced semantic understanding and task alignment. ### Main Contributions: 1. **Novel Monolithic Architecture**: Proposes Mono-InternVL, which seamlessly integrates visual experts through a multimodal mixture-of-experts structure, effectively extending the pre-trained language model while retaining pre-trained knowledge. 2. **Innovative Pre-training Method**: Introduces Endogenous Visual Pre-training (EViP), employing a phased learning strategy that encourages visual experts to continuously acquire visual knowledge from noisy to high-quality data. 3. **Leading Performance**: Mono-InternVL achieves significant performance improvements on multiple multimodal benchmarks, particularly excelling in mathematical reasoning and text recognition tasks, while also outperforming existing models in deployment efficiency. These contributions not only address key issues in monolithic MLLMs but also provide new directions for future design.

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

InfMLLM: A Unified Framework for Visual-Language Tasks.

Multimodal Pretraining from Monolingual to Multilingual

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Unified Generative and Discriminative Training for Multi-modal Large Language Models

Demonstrative Instruction Following in Multimodal LLMs Via Integrating Low-Rank Adaptation with Ensemble Learning

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Efficient Multi-modal Large Language Models via Visual Token Grouping