Abstract:This study explores replacing Transformers in Visual Language Models (VLMs) with Mamba, a recent structured state space model (SSM) that demonstrates promising performance in sequence modeling. We test models up to 3B parameters under controlled conditions, showing that Mamba-based VLMs outperforms Transformers-based VLMs in captioning, question answering, and reading comprehension. However, we find that Transformers achieve greater performance in visual grounding and the performance gap widens with scale. We explore two hypotheses to explain this phenomenon: 1) the effect of task-agnostic visual encoding on the updates of the hidden states, and 2) the difficulty in performing visual grounding from the perspective of in-context multimodal retrieval. Our results indicate that a task-aware encoding yields minimal performance gains on grounding, however, Transformers significantly outperform Mamba at in-context multimodal retrieval. Overall, Mamba shows promising performance on tasks where the correct output relies on a summary of the image but struggles when retrieval of explicit information from the context is required.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to explore the effect of replacing the Transformer with the Mamba model in Structured State - Space Models (SSMs) in Vision - and - Language Models (VLMs). Specifically, the research aims to answer the following key questions: 1. **Performance Comparison**: Can the Mamba model match or even outperform Transformer - based VLMs in various vision - and - language tasks (such as image captioning, visual question answering, and reading comprehension)? 2. **Task Differences**: What are the differences in the performance of Mamba and Transformer on different types of multimodal tasks? In particular, what is the performance gap between the two in tasks that require fine - grained information (such as visual localization)? 3. **Cause Analysis**: If there are performance differences, what are the reasons behind them? In particular, why does Mamba perform well in some tasks but not as well as Transformer in others? To answer these questions, the author has carried out the following work: - Trained and evaluated Mamba - VL and Pythia - VL (Transformer - based VLM) models with multiple orders of magnitude of parameters (from 790M to 2.8B). - Conducted tests in a series of multimodal tasks, including coarse - grained tasks (such as image captioning, visual question answering) and fine - grained tasks (such as visual localization). - Explored the impact of task - agnostic visual encoding and multimodal retrieval on model performance. Through these experiments, the author has found that: - Mamba - VL performs excellently in tasks such as image captioning, visual question answering, and reading comprehension, and even outperforms Pythia - VL of the same scale. - However, in tasks that require precise retrieval of information from the context (such as visual localization), Pythia - VL significantly outperforms Mamba - VL, and this gap further widens as the model scale increases. - The author has proposed two hypotheses to explain this phenomenon: the difficulty of task - agnostic visual encoding and multimodal retrieval, and has verified these hypotheses through experiments. In conclusion, through systematic experiments and analysis, this paper reveals the advantages and limitations of the Mamba model in multimodal tasks, providing valuable references for future research.

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling

An Empirical Study of Mamba-based Language Models

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

VL-Mamba: Exploring State Space Models for Multimodal Learning

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Mamba Fusion: Learning Actions Through Questioning

LocalMamba: Visual State Space Model with Windowed Selective Scan

Mamba State-Space Models Are Lyapunov-Stable Learners

Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges

A Survey on Vision Mamba: Models, Applications and Challenges

Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Visual Mamba: A Survey and New Outlooks

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

VMamba: Visual State Space Model

Vision Mamba: A Comprehensive Survey and Taxonomy