VL-Mamba: Exploring State Space Models for Multimodal Learning

Yanyuan Qiao,Zheng Yu,Longteng Guo,Sihan Chen,Zijia Zhao,Mingzhen Sun,Qi Wu,Jing Liu

2024-03-20

Abstract:Multimodal large language models (MLLMs) have attracted widespread interest and have rich applications. However, the inherent attention mechanism in its Transformer structure requires quadratic complexity and results in expensive computational overhead. Therefore, in this work, we propose VL-Mamba, a multimodal large language model based on state space models, which have been shown to have great potential for long-sequence modeling with fast inference and linear scaling in sequence length. Specifically, we first replace the transformer-based backbone language model such as LLama or Vicuna with the pre-trained Mamba language model. Then, we empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning and the combinations of different vision encoders and variants of pretrained Mamba language models. The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper mainly addresses the computational efficiency and long sequence modeling problems in Multimodal Large Language Models (MLLMs). Traditional Transformer-based structures suffer from high computational complexity and large memory requirements due to the self-attention mechanism. To address these issues, the paper proposes VL-Mamba, a Multimodal Large Language Model based on State Space Models (SSMs). SSMs have the advantages of fast inference and linear sequence length expansion in long sequence modeling. The core of VL-Mamba includes: 1. Using the pre-trained Mamba language model instead of the Transformer baseline model (such as LLama or Vicuna) as the backend language model. 2. Experimenting with a 2D visual selective scanning mechanism to adapt to multimodal learning, and designing a new architecture called MultiModal Connector (MMC), which includes a Vision Selective Scan (VSS) module to enhance the modeling capability of 2D visual sequences. 3. Exploring different visual encoders, Mamba language model variants, and combinations of multimodal connectors to understand the impact of different components on the performance of VL-Mamba. Through extensive experiments on multiple multimodal benchmark tests, VL-Mamba demonstrates competitive performance with existing multimodal large language models and even outperforms large models (such as LLaMA-1.5's 7B and 13B versions) in certain tasks. The contribution of the paper lies in the first introduction of SSM into multimodal learning tasks, proposing a new framework option, and open-sourcing the code to facilitate research in related fields.

VL-Mamba: Exploring State Space Models for Multimodal Learning

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models

Mamba Fusion: Learning Actions Through Questioning

LocalMamba: Visual State Space Model with Windowed Selective Scan

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

V2M: Visual 2-Dimensional Mamba for Image Representation Learning

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

VMamba: Visual State Space Model

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

InfMLLM: A Unified Framework for Visual-Language Tasks.

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

VSSD: Vision Mamba with Non-Causal State Space Duality