Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Ziyuan Huang,Kaixiang Ji,Biao Gong,Zhiwu Qing,Qinglong Zhang,Kecheng Zheng,Jian Wang,Jingdong Chen,Ming Yang
2024-07-23
Abstract:This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales. This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pre-training. Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by ~73%. Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to reduce the number of visual tokens during the pre - training process of multimodal large language models (MLLMs) to accelerate the pre - training process without sacrificing the performance of the model in downstream tasks. Specifically, current multimodal large language models need to process a large number of visual tokens in the pre - training stage, which leads to an overly long pre - training time and excessive consumption of computational resources. These problems not only limit the experimental capabilities of researchers but also increase the environmental burden. Therefore, the paper proposes a method named "Chain - of - Sight". By reducing the number of visual tokens in the pre - training stage and increasing the number of visual tokens in the fine - tuning stage, it achieves the acceleration of pre - training while maintaining or improving the performance of the model. ### Main Contributions 1. **Accelerating Pre - training**: By reducing the number of visual tokens in the pre - training stage, the pre - training time is significantly reduced, and the pre - training time can be reduced by up to about 73%. 2. **Performance Maintenance**: Although the number of visual tokens in the pre - training stage is reduced, by increasing the number of visual tokens in the fine - tuning stage, the final performance of the model is comparable to or better than that of the model using all visual tokens. 3. **Multi - scale Visual Resampler**: A multi - scale visual resampler is introduced, which can generate visual tokens of multiple scales, thereby flexibly adjusting the number of visual tokens in the pre - training and fine - tuning stages. 4. **Compound Token Expansion Strategy**: A compound token expansion strategy is proposed, which combines resolution expansion and window expansion, enabling a significant increase in the number of visual tokens in the fine - tuning stage, up to 16 times. ### Method Overview - **Multi - scale Visual Resampler**: By dividing visual features into windows of different sizes and generating visual tokens within each window, multi - scale visual tokens are generated. - **Compound Token Expansion Strategy**: Use fewer visual tokens in the pre - training stage, and increase the number of visual tokens in the fine - tuning stage through resolution expansion and window expansion. - **Coarse - to - Fine Integration**: After generating multi - scale visual tokens, these tokens are integrated into the language model in a coarse - to - fine order to improve the visual understanding ability of the model. ### Experimental Results - **Pre - training Acceleration**: The experimental results show that using the Chain - of - Sight method can significantly reduce the pre - training time without sacrificing performance. - **Downstream Task Performance**: On multiple visual - language benchmark tasks, the performance of the Chain - of - Sight model is comparable to or better than that of the model using all visual tokens, especially in image captioning, visual question answering, and text recognition tasks. In conclusion, through the introduction of the Chain - of - Sight method, this paper effectively solves the problem of excessive time and resource consumption in the pre - training process of multimodal large language models, providing new ideas and methods for future multimodal model research.