VL-Mamba: Exploring State Space Models for Multimodal Learning

Yanyuan Qiao,Zheng Yu,Longteng Guo,Sihan Chen,Zijia Zhao,Mingzhen Sun,Qi Wu,Jing Liu
2024-03-20
Abstract:Multimodal large language models (MLLMs) have attracted widespread interest and have rich applications. However, the inherent attention mechanism in its Transformer structure requires quadratic complexity and results in expensive computational overhead. Therefore, in this work, we propose VL-Mamba, a multimodal large language model based on state space models, which have been shown to have great potential for long-sequence modeling with fast inference and linear scaling in sequence length. Specifically, we first replace the transformer-based backbone language model such as LLama or Vicuna with the pre-trained Mamba language model. Then, we empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning and the combinations of different vision encoders and variants of pretrained Mamba language models. The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper mainly addresses the computational efficiency and long sequence modeling problems in Multimodal Large Language Models (MLLMs). Traditional Transformer-based structures suffer from high computational complexity and large memory requirements due to the self-attention mechanism. To address these issues, the paper proposes VL-Mamba, a Multimodal Large Language Model based on State Space Models (SSMs). SSMs have the advantages of fast inference and linear sequence length expansion in long sequence modeling. The core of VL-Mamba includes: 1. Using the pre-trained Mamba language model instead of the Transformer baseline model (such as LLama or Vicuna) as the backend language model. 2. Experimenting with a 2D visual selective scanning mechanism to adapt to multimodal learning, and designing a new architecture called MultiModal Connector (MMC), which includes a Vision Selective Scan (VSS) module to enhance the modeling capability of 2D visual sequences. 3. Exploring different visual encoders, Mamba language model variants, and combinations of multimodal connectors to understand the impact of different components on the performance of VL-Mamba. Through extensive experiments on multiple multimodal benchmark tests, VL-Mamba demonstrates competitive performance with existing multimodal large language models and even outperforms large models (such as LLaMA-1.5's 7B and 13B versions) in certain tasks. The contribution of the paper lies in the first introduction of SSM into multimodal learning tasks, proposing a new framework option, and open-sourcing the code to facilitate research in related fields.