MambaOut: Do We Really Need Mamba for Vision?

Weihao Yu,Xinchao Wang

2024-05-21

Abstract:Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of Mamba for long-sequence visual tasks. The code is available at

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper primarily discusses the necessity of the Mamba architecture in computer vision tasks, especially for tasks such as image classification, object detection, and semantic segmentation. Mamba is an architecture based on the State Space Model (SSM) designed to address the quadratic complexity problem in attention mechanisms in Transformers. However, the paper finds that Mamba does not perform as well as convolutional and attention-based baseline models in visual tasks. The authors propose two hypotheses through analysis: 1. For image classification tasks, SSM is unnecessary as it does not require handling long sequences or autoregressive properties. 2. Although detection and segmentation tasks also lack autoregressive properties, exploring the potential of Mamba is still valuable due to their adherence to long sequence characteristics. To validate these hypotheses, the authors construct a set of models named MambaOut, which are stacked with gated convolutional blocks and remove the SSM. Experimental results show that MambaOut outperforms the visual Mamba model in ImageNet image classification, supporting the first hypothesis. However, in detection and segmentation tasks, MambaOut performs worse than state-of-the-art visual Mamba models, indicating that SSM may still have potential in such tasks, validating the second hypothesis. The main contributions of the paper include theoretical analysis of the task types suitable for Mamba, exploration of visual task features, and proposing and experimentally confirming the hypotheses regarding the necessity of Mamba in visual recognition. MambaOut, as a simplified version of the model, can serve as a foundation for future research on visual Mamba models.

MambaOut: Do We Really Need Mamba for Vision?

Demystify Mamba in Vision: A Linear Attention Perspective

A Survey on Visual Mamba

Mamba-R: Vision Mamba ALSO Needs Registers

A Survey on Vision Mamba: Models, Applications and Challenges

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Visual Mamba: A Survey and New Outlooks

Mamba in Vision: A Comprehensive Survey of Techniques and Applications

Vision Mamba: A Comprehensive Survey and Taxonomy

QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

V2M: Visual 2-Dimensional Mamba for Image Representation Learning

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

Autoregressive Pretraining with Mamba in Vision

PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

LocalMamba: Visual State Space Model with Windowed Selective Scan

TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba