ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Wenjun Huang,Jiakai Pan,Jiahao Tang,Yanyu Ding,Yifei Xing,Yuhe Wang,Zhengzhuo Wang,Jianguo Hu
2024-08-21
Abstract:Multimodal Large Language Models (MLLMs) have attracted much attention for their multifunctionality. However, traditional Transformer architectures incur significant overhead due to their secondary computational complexity. To address this issue, we introduce ML-Mamba, a multimodal language model, which utilizes the latest and efficient Mamba-2 model for inference. Mamba-2 is known for its linear scalability and fast processing of long sequences. We replace the Transformer-based backbone with a pre-trained Mamba-2 model and explore methods for integrating 2D visual selective scanning mechanisms into multimodal learning while also trying various visual encoders and Mamba-2 model variants. Our extensive experiments in various multimodal benchmark tests demonstrate the competitive performance of ML-Mamba and highlight the potential of state space models in multimodal tasks. The experimental results show that: (1) we empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning. We propose a novel multimodal connector called the Mamba-2 Scan Connector (MSC), which enhances representational capabilities. (2) ML-Mamba achieves performance comparable to state-of-the-art methods such as TinyLaVA and MobileVLM v2 through its linear sequential modeling while faster inference speed; (3) Compared to multimodal models utilizing Mamba-1, the Mamba-2-based ML-Mamba exhibits superior inference performance and effectiveness.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issues of improving computational efficiency and inference performance in multimodal large language models (MLLM). Specifically: 1. **Computational Efficiency Issue**: The traditional Transformer architecture faces quadratic computational complexity when handling long sequences, leading to low computational efficiency. The paper proposes a multimodal language model, ML-Mamba, based on the latest Mamba-2 model, aiming to solve this bottleneck through linear scalability and the ability to quickly process long sequences. 2. **Multimodal Task Performance**: Existing models are mostly based on the Transformer architecture, but their performance in multimodal tasks still needs improvement. ML-Mamba enhances representation capabilities by introducing innovative mechanisms such as the Mamba-2 Scan Connector (MSC) and demonstrates performance comparable to or even better than existing advanced methods in multiple benchmark tests. 3. **Integration of Visual Information**: Researchers have been exploring how to better combine visual information with textual information to address real-world challenges. ML-Mamba not only efficiently processes visual information but also achieves significant results in multiple multimodal tasks, particularly excelling in overcoming visual illusions and spatial relationship judgments. In summary, the paper aims to improve the computational efficiency and task performance of multimodal large language models by introducing new architectures and technologies.