Abstract:Recent advances in Large Language Models (LLMs) have catalyzed the development of Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning language and image instructions, ignoring the critical pretraining phase where models learn to process textual and visual modalities jointly. In this paper, we propose a new pretraining paradigm for LMMs to enhance the visual comprehension capabilities of LLMs by introducing a novel cross-modal comprehension stage. Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens. Then, we conceptualize visual tokens as analogous to a "foreign language" for the LLMs and propose a mixed attention mechanism with bidirectional visual attention and unidirectional textual attention to comprehensively enhance the understanding of visual tokens. Meanwhile, we integrate a detailed caption generation task, leveraging rich descriptions to further facilitate LLMs in understanding visual semantic information. After pretraining on 1.5 million publicly accessible data, we present a new foundation model called Croc. Experimental results demonstrate that Croc achieves new state-of-the-art performance on massive vision-language benchmarks. To support reproducibility and facilitate further research, we release the training code and pre-trained model weights at <a class="link-external link-https" href="https://github.com/deepglint/Croc" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficiency of existing large - scale multimodal models (LMMs) in jointly processing text and visual modalities during the pre - training stage. Specifically, current research mainly focuses on adjusting language and image instructions while ignoring the pre - training stage, which is a crucial period for the model to learn to process text and visual information simultaneously. To address this issue, the authors propose a new pre - training paradigm aimed at enhancing the visual understanding ability of large - scale language models (LLMs). They introduce a new cross - modal understanding stage and achieve this goal through the following methods: 1. **Dynamic Learnable Prompt Token Pool**: Design a dynamic learnable prompt token pool and use the Hungarian algorithm to partially replace the original visual tokens with the most relevant prompt tokens. 2. **Mixed Attention Mechanism**: Consider visual tokens as the "foreign language" of LLMs and propose a mixed attention mechanism with bidirectional visual attention and unidirectional text attention to comprehensively enhance the understanding of visual tokens. 3. **Detailed Caption Generation Task**: Integrate a detailed caption generation task and utilize rich descriptions to further assist LLMs in understanding visual semantic information. These innovations enable the proposed Croc model to achieve the latest state - of - the - art performance in multiple benchmark tests, especially on large - scale visual - language benchmarks. ### Formula Summary 1. **Probability Calculation of Generating Context - Dependent Captions**: \[ p(T_c|T_v, T_t)=\prod_{i = 1}^{L}p(c_i|T_v, T_t,<i, T_c,<i) \] where \(T_c\) is the original caption, \(T_v\) is the visual token, \(T_t\) is the text token, and \(L\) is the sequence length. 2. **Minimizing the Cost Function of the Hungarian Algorithm**: \[ \hat{\sigma}=\arg\min_{\sigma\in S_N}\sum_{i}\|e_T^v - T_{\sigma(i)}^p\|^2 \] where \(e_T^v\) is the set of masked visual tokens to be replaced, \(T_p\) is the prompt token pool, and \(\sigma\) is a permutation. 3. **Visual Token Reconstruction Loss**: \[ L_{VTR}=\sum_{i\in\Theta}\|T_i^{rv}-T_i^v\| \] where \(\Theta\) is the index set of replaced visual tokens. 4. **Detailed Caption Generation Loss**: \[ L_{DCG}=\sum_i\log p(t_i|\hat{T}_v, t_1,\dots, t_{i - 1}) \] 5. **Overall Training Loss**: \[ L=\alpha L_{VTR}+L_{DCG} \] where \(\alpha\) is a weight used to balance the influence of different losses. Through these methods, the Croc model significantly improves the LLMs' ability to understand and reason about visual information.

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Multimodal Pretraining from Monolingual to Multilingual

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

InfMLLM: A Unified Framework for Visual-Language Tasks.

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

CROME: Cross-Modal Adapters for Efficient Multimodal LLM

What Makes for Good Visual Tokenizers for Large Language Models?

Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

CogVLM: Visual Expert for Pretrained Language Models

From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception