EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Feipeng Ma,Yizhou Zhou,Hebei Li,Zilong He,Siying Wu,Fengyun Rao,Yueyi Zhang,Xiaoyan Sun
2024-09-10
Abstract:In the realm of multimodal research, numerous studies leverage substantial image-text pairs to conduct modal alignment learning, transforming Large Language Models (LLMs) into Multimodal LLMs and excelling in a variety of visual-language tasks. The prevailing methodologies primarily fall into two categories: self-attention-based and cross-attention-based methods. While self-attention-based methods offer superior data efficiency due to their simple MLP architecture, they often suffer from lower computational efficiency due to concatenating visual and textual tokens as input for LLM. Conversely, cross-attention-based methods, although less data-efficient due to additional learnable parameters, exhibit higher computational efficiency by avoiding long sequence input for LLM. To address these trade-offs, we introduce the Data-Efficient and Compute-Efficient Multimodal Large Language Model (EE-MLLM). Without introducing additional modules or learnable parameters, EE-MLLM achieves both data and compute efficiency. Specifically, we modify the original self-attention mechanism in MLLM to a composite attention mechanism. This mechanism has two key characteristics: 1) Eliminating the computational overhead of self-attention within visual tokens to achieve compute efficiency, and 2) Reusing the weights on each layer of LLM to facilitate effective modality alignment between vision and language for data efficiency. Experimental results demonstrate the effectiveness of EE-MLLM across a range of benchmarks, including general-purpose datasets like MMBench and SeedBench, as well as fine-grained tasks such as TextVQA and DocVQA.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the problem of balancing data efficiency and computational efficiency in Multimodal Large Language Models (MLLMs). Specifically, existing multimodal alignment methods have the following issues: 1. **Self-Attention Mechanism Method**: - **Advantages**: High data efficiency because the alignment module is simple and has few parameters, and vision and text are naturally aligned at each layer. - **Disadvantages**: Low computational efficiency because directly concatenating vision and text tokens increases the input sequence length, leading to a significant increase in computational cost, especially when processing high-resolution images. 2. **Cross-Attention Mechanism Method**: - **Advantages**: High computational efficiency because the input sequence length remains unchanged and does not increase with the number of vision tokens. - **Disadvantages**: Low data efficiency because a large amount of pre-training data is required to optimize the alignment between vision and text, increasing training complexity. To overcome the limitations of these methods, the paper proposes **EE-MLLM** (Data-Efficient and Compute-Efficient Multimodal Large Language Model), which introduces a composite attention mechanism to achieve both data efficiency and computational efficiency. Specific improvements include: 1. **Eliminating Self-Attention Computation within Vision Tokens**: By removing self-attention computation within vision tokens, computational overhead is reduced, and the input sequence length remains the same as the text tokens. 2. **Reusing Weights of LLM Layers**: By reusing the weights of LLM layers as aligners, effective alignment between vision and text is promoted without introducing additional trainable parameters, thereby improving data efficiency. With these improvements, EE-MLLM performs excellently on multiple benchmarks while significantly enhancing computational efficiency during the inference phase.