Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Gen Luo,Xue Yang,Wenhan Dou,Zhaokai Wang,Jifeng Dai,Yu Qiao,Xizhou Zhu
2024-10-11
Abstract:The rapid advancement of Large Language Models (LLMs) has led to an influx of efforts to extend their capabilities to multimodal tasks. Among them, growing attention has been focused on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. Despite the structural simplicity and deployment-friendliness, training a monolithic MLLM with promising performance still remains challenging. In particular, the popular approaches adopt continuous pre-training to extend a pre-trained LLM to a monolithic MLLM, which suffers from catastrophic forgetting and leads to performance degeneration. In this paper, we aim to overcome this limitation from the perspective of delta tuning. Specifically, our core idea is to embed visual parameters into a pre-trained LLM, thereby incrementally learning visual knowledge from massive data via delta tuning, i.e., freezing the LLM when optimizing the visual parameters. Based on this principle, we present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure. Moreover, we propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP). In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data. To validate our approach, we conduct extensive experiments on 16 benchmarks. Experimental results not only validate the superior performance of Mono-InternVL compared to the state-of-the-art MLLM on 6 multimodal benchmarks, e.g., +113 points over InternVL-1.5 on OCRBench, but also confirm its better deployment efficiency, with first token latency reduced by up to 67%.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of overcoming catastrophic forgetting during continuous pre-training in Monolithic Multimodal Large Language Models (MLLMs) to enhance the model's visual and language capabilities. Specifically, existing monolithic MLLMs often suffer from performance degradation due to catastrophic forgetting when extending pre-trained language models through continuous pre-training. The paper proposes a novel approach by introducing an independent set of visual parameters and employing partial parameter tuning for visual pre-training to retain pre-trained language knowledge while enhancing visual learning capabilities. ### Specific Issues: 1. **Catastrophic Forgetting**: During continuous pre-training, optimizing for visual tasks can impair existing language capabilities. 2. **Performance Enhancement**: How to significantly improve performance on visual tasks while maintaining language capabilities. 3. **Deployment Efficiency**: How to design a simple and efficient monolithic MLLM for practical deployment. ### Solutions: - **Delta Tuning**: By freezing the pre-trained language model parameters and only optimizing the newly added visual parameters, catastrophic forgetting is avoided. - **Visual Expert Integration**: Embedding a set of visual experts within the pre-trained language model, using a Multimodal Mixture-of-Experts (MoE) structure to handle both visual and textual information. - **Endogenous Visual Pre-training (EViP)**: A phased pre-training strategy is designed, gradually progressing from basic visual concept learning to advanced semantic understanding and task alignment. ### Main Contributions: 1. **Novel Monolithic Architecture**: Proposes Mono-InternVL, which seamlessly integrates visual experts through a multimodal mixture-of-experts structure, effectively extending the pre-trained language model while retaining pre-trained knowledge. 2. **Innovative Pre-training Method**: Introduces Endogenous Visual Pre-training (EViP), employing a phased learning strategy that encourages visual experts to continuously acquire visual knowledge from noisy to high-quality data. 3. **Leading Performance**: Mono-InternVL achieves significant performance improvements on multiple multimodal benchmarks, particularly excelling in mathematical reasoning and text recognition tasks, while also outperforming existing models in deployment efficiency. These contributions not only address key issues in monolithic MLLMs but also provide new directions for future design.