Abstract:In the realm of multimodal research, numerous studies leverage substantial image-text pairs to conduct modal alignment learning, transforming Large Language Models (LLMs) into Multimodal LLMs and excelling in a variety of visual-language tasks. The prevailing methodologies primarily fall into two categories: self-attention-based and cross-attention-based methods. While self-attention-based methods offer superior data efficiency due to their simple MLP architecture, they often suffer from lower computational efficiency due to concatenating visual and textual tokens as input for LLM. Conversely, cross-attention-based methods, although less data-efficient due to additional learnable parameters, exhibit higher computational efficiency by avoiding long sequence input for LLM. To address these trade-offs, we introduce the Data-Efficient and Compute-Efficient Multimodal Large Language Model (EE-MLLM). Without introducing additional modules or learnable parameters, EE-MLLM achieves both data and compute efficiency. Specifically, we modify the original self-attention mechanism in MLLM to a composite attention mechanism. This mechanism has two key characteristics: 1) Eliminating the computational overhead of self-attention within visual tokens to achieve compute efficiency, and 2) Reusing the weights on each layer of LLM to facilitate effective modality alignment between vision and language for data efficiency. Experimental results demonstrate the effectiveness of EE-MLLM across a range of benchmarks, including general-purpose datasets like MMBench and SeedBench, as well as fine-grained tasks such as TextVQA and DocVQA.

What problem does this paper attempt to address?

This paper attempts to address the problem of balancing data efficiency and computational efficiency in Multimodal Large Language Models (MLLMs). Specifically, existing multimodal alignment methods have the following issues: 1. **Self-Attention Mechanism Method**: - **Advantages**: High data efficiency because the alignment module is simple and has few parameters, and vision and text are naturally aligned at each layer. - **Disadvantages**: Low computational efficiency because directly concatenating vision and text tokens increases the input sequence length, leading to a significant increase in computational cost, especially when processing high-resolution images. 2. **Cross-Attention Mechanism Method**: - **Advantages**: High computational efficiency because the input sequence length remains unchanged and does not increase with the number of vision tokens. - **Disadvantages**: Low data efficiency because a large amount of pre-training data is required to optimize the alignment between vision and text, increasing training complexity. To overcome the limitations of these methods, the paper proposes **EE-MLLM** (Data-Efficient and Compute-Efficient Multimodal Large Language Model), which introduces a composite attention mechanism to achieve both data efficiency and computational efficiency. Specific improvements include: 1. **Eliminating Self-Attention Computation within Vision Tokens**: By removing self-attention computation within vision tokens, computational overhead is reduced, and the input sequence length remains the same as the text tokens. 2. **Reusing Weights of LLM Layers**: By reusing the weights of LLM layers as aligners, effective alignment between vision and text is promoted without introducing additional trainable parameters, thereby improving data efficiency. With these improvements, EE-MLLM performs excellently on multiple benchmarks while significantly enhancing computational efficiency during the inference phase.

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Efficient Multimodal Large Language Models: A Survey

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Demonstrative Instruction Following in Multimodal LLMs Via Integrating Low-Rank Adaptation with Ensemble Learning

E5-V: Universal Embeddings with Multimodal Large Language Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

NoteLLM-2: Multimodal Large Representation Models for Recommendation

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

EMMA: Efficient Visual Alignment in Multi-Modal LLMs

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

Skipping Computations in Multimodal LLMs

A Survey of Multimodal Large Language Model from A Data-centric Perspective

LMEye: An Interactive Perception Network for Large Language Models

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Face-MLLM: A Large Face Perception Model

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts