Abstract:This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales. This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pre-training. Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by ~73%. Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to reduce the number of visual tokens during the pre - training process of multimodal large language models (MLLMs) to accelerate the pre - training process without sacrificing the performance of the model in downstream tasks. Specifically, current multimodal large language models need to process a large number of visual tokens in the pre - training stage, which leads to an overly long pre - training time and excessive consumption of computational resources. These problems not only limit the experimental capabilities of researchers but also increase the environmental burden. Therefore, the paper proposes a method named "Chain - of - Sight". By reducing the number of visual tokens in the pre - training stage and increasing the number of visual tokens in the fine - tuning stage, it achieves the acceleration of pre - training while maintaining or improving the performance of the model. ### Main Contributions 1. **Accelerating Pre - training**: By reducing the number of visual tokens in the pre - training stage, the pre - training time is significantly reduced, and the pre - training time can be reduced by up to about 73%. 2. **Performance Maintenance**: Although the number of visual tokens in the pre - training stage is reduced, by increasing the number of visual tokens in the fine - tuning stage, the final performance of the model is comparable to or better than that of the model using all visual tokens. 3. **Multi - scale Visual Resampler**: A multi - scale visual resampler is introduced, which can generate visual tokens of multiple scales, thereby flexibly adjusting the number of visual tokens in the pre - training and fine - tuning stages. 4. **Compound Token Expansion Strategy**: A compound token expansion strategy is proposed, which combines resolution expansion and window expansion, enabling a significant increase in the number of visual tokens in the fine - tuning stage, up to 16 times. ### Method Overview - **Multi - scale Visual Resampler**: By dividing visual features into windows of different sizes and generating visual tokens within each window, multi - scale visual tokens are generated. - **Compound Token Expansion Strategy**: Use fewer visual tokens in the pre - training stage, and increase the number of visual tokens in the fine - tuning stage through resolution expansion and window expansion. - **Coarse - to - Fine Integration**: After generating multi - scale visual tokens, these tokens are integrated into the language model in a coarse - to - fine order to improve the visual understanding ability of the model. ### Experimental Results - **Pre - training Acceleration**: The experimental results show that using the Chain - of - Sight method can significantly reduce the pre - training time without sacrificing performance. - **Downstream Task Performance**: On multiple visual - language benchmark tasks, the performance of the Chain - of - Sight model is comparable to or better than that of the model using all visual tokens, especially in image captioning, visual question answering, and text recognition tasks. In conclusion, through the introduction of the Chain - of - Sight method, this paper effectively solves the problem of excessive time and resource consumption in the pre - training process of multimodal large language models, providing new ideas and methods for future multimodal model research.

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Efficient Multi-modal Large Language Models via Visual Token Grouping

Multimodal Pretraining from Monolingual to Multilingual

ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Unified Generative and Discriminative Training for Multi-modal Large Language Models

Improving Multi-modal Large Language Model through Boosting Vision Capabilities

Vision-language pre-training via modal interaction

InfMLLM: A Unified Framework for Visual-Language Tasks.

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Demonstrative Instruction Following in Multimodal LLMs Via Integrating Low-Rank Adaptation with Ensemble Learning

Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

TokenPacker: Efficient Visual Projector for Multimodal LLM

Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs