ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Wenjun Huang,Jiakai Pan,Jiahao Tang,Yanyu Ding,Yifei Xing,Yuhe Wang,Zhengzhuo Wang,Jianguo Hu

2024-08-21

Abstract:Multimodal Large Language Models (MLLMs) have attracted much attention for their multifunctionality. However, traditional Transformer architectures incur significant overhead due to their secondary computational complexity. To address this issue, we introduce ML-Mamba, a multimodal language model, which utilizes the latest and efficient Mamba-2 model for inference. Mamba-2 is known for its linear scalability and fast processing of long sequences. We replace the Transformer-based backbone with a pre-trained Mamba-2 model and explore methods for integrating 2D visual selective scanning mechanisms into multimodal learning while also trying various visual encoders and Mamba-2 model variants. Our extensive experiments in various multimodal benchmark tests demonstrate the competitive performance of ML-Mamba and highlight the potential of state space models in multimodal tasks. The experimental results show that: (1) we empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning. We propose a novel multimodal connector called the Mamba-2 Scan Connector (MSC), which enhances representational capabilities. (2) ML-Mamba achieves performance comparable to state-of-the-art methods such as TinyLaVA and MobileVLM v2 through its linear sequential modeling while faster inference speed; (3) Compared to multimodal models utilizing Mamba-1, the Mamba-2-based ML-Mamba exhibits superior inference performance and effectiveness.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the issues of improving computational efficiency and inference performance in multimodal large language models (MLLM). Specifically: 1. **Computational Efficiency Issue**: The traditional Transformer architecture faces quadratic computational complexity when handling long sequences, leading to low computational efficiency. The paper proposes a multimodal language model, ML-Mamba, based on the latest Mamba-2 model, aiming to solve this bottleneck through linear scalability and the ability to quickly process long sequences. 2. **Multimodal Task Performance**: Existing models are mostly based on the Transformer architecture, but their performance in multimodal tasks still needs improvement. ML-Mamba enhances representation capabilities by introducing innovative mechanisms such as the Mamba-2 Scan Connector (MSC) and demonstrates performance comparable to or even better than existing advanced methods in multiple benchmark tests. 3. **Integration of Visual Information**: Researchers have been exploring how to better combine visual information with textual information to address real-world challenges. ML-Mamba not only efficiently processes visual information but also achieves significant results in multiple multimodal tasks, particularly excelling in overcoming visual illusions and spatial relationship judgments. In summary, the paper aims to improve the computational efficiency and task performance of multimodal large language models by introducing new architectures and technologies.

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

VL-Mamba: Exploring State Space Models for Multimodal Learning

An Empirical Study of Mamba-based Language Models

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

V2M: Visual 2-Dimensional Mamba for Image Representation Learning

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models

EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment

MammothModa: Multi-Modal Large Language Model

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling

A Survey of Mamba

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

Revealing and Mitigating the Local Pattern Shortcuts of Mamba

Multimodal Instruction Tuning with Hybrid State Space Models

Demystify Mamba in Vision: A Linear Attention Perspective