Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Yin Xie,Kaicheng Yang,Ninghua Yang,Weimo Deng,Xiangzi Dai,Tiancheng Gu,Yumeng Wang,Xiang An,Yongle Zhao,Ziyong Feng,Jiankang Deng
2024-10-18
Abstract:Recent advances in Large Language Models (LLMs) have catalyzed the development of Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning language and image instructions, ignoring the critical pretraining phase where models learn to process textual and visual modalities jointly. In this paper, we propose a new pretraining paradigm for LMMs to enhance the visual comprehension capabilities of LLMs by introducing a novel cross-modal comprehension stage. Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens. Then, we conceptualize visual tokens as analogous to a "foreign language" for the LLMs and propose a mixed attention mechanism with bidirectional visual attention and unidirectional textual attention to comprehensively enhance the understanding of visual tokens. Meanwhile, we integrate a detailed caption generation task, leveraging rich descriptions to further facilitate LLMs in understanding visual semantic information. After pretraining on 1.5 million publicly accessible data, we present a new foundation model called Croc. Experimental results demonstrate that Croc achieves new state-of-the-art performance on massive vision-language benchmarks. To support reproducibility and facilitate further research, we release the training code and pre-trained model weights at <a class="link-external link-https" href="https://github.com/deepglint/Croc" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficiency of existing large - scale multimodal models (LMMs) in jointly processing text and visual modalities during the pre - training stage. Specifically, current research mainly focuses on adjusting language and image instructions while ignoring the pre - training stage, which is a crucial period for the model to learn to process text and visual information simultaneously. To address this issue, the authors propose a new pre - training paradigm aimed at enhancing the visual understanding ability of large - scale language models (LLMs). They introduce a new cross - modal understanding stage and achieve this goal through the following methods: 1. **Dynamic Learnable Prompt Token Pool**: Design a dynamic learnable prompt token pool and use the Hungarian algorithm to partially replace the original visual tokens with the most relevant prompt tokens. 2. **Mixed Attention Mechanism**: Consider visual tokens as the "foreign language" of LLMs and propose a mixed attention mechanism with bidirectional visual attention and unidirectional text attention to comprehensively enhance the understanding of visual tokens. 3. **Detailed Caption Generation Task**: Integrate a detailed caption generation task and utilize rich descriptions to further assist LLMs in understanding visual semantic information. These innovations enable the proposed Croc model to achieve the latest state - of - the - art performance in multiple benchmark tests, especially on large - scale visual - language benchmarks. ### Formula Summary 1. **Probability Calculation of Generating Context - Dependent Captions**: \[ p(T_c|T_v, T_t)=\prod_{i = 1}^{L}p(c_i|T_v, T_t,<i, T_c,<i) \] where \(T_c\) is the original caption, \(T_v\) is the visual token, \(T_t\) is the text token, and \(L\) is the sequence length. 2. **Minimizing the Cost Function of the Hungarian Algorithm**: \[ \hat{\sigma}=\arg\min_{\sigma\in S_N}\sum_{i}\|e_T^v - T_{\sigma(i)}^p\|^2 \] where \(e_T^v\) is the set of masked visual tokens to be replaced, \(T_p\) is the prompt token pool, and \(\sigma\) is a permutation. 3. **Visual Token Reconstruction Loss**: \[ L_{VTR}=\sum_{i\in\Theta}\|T_i^{rv}-T_i^v\| \] where \(\Theta\) is the index set of replaced visual tokens. 4. **Detailed Caption Generation Loss**: \[ L_{DCG}=\sum_i\log p(t_i|\hat{T}_v, t_1,\dots, t_{i - 1}) \] 5. **Overall Training Loss**: \[ L=\alpha L_{VTR}+L_{DCG} \] where \(\alpha\) is a weight used to balance the influence of different losses. Through these methods, the Croc model significantly improves the LLMs' ability to understand and reason about visual information.