MambaOut: Do We Really Need Mamba for Vision?

Weihao Yu,Xinchao Wang
2024-05-21
Abstract:Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of Mamba for long-sequence visual tasks. The code is available at
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper primarily discusses the necessity of the Mamba architecture in computer vision tasks, especially for tasks such as image classification, object detection, and semantic segmentation. Mamba is an architecture based on the State Space Model (SSM) designed to address the quadratic complexity problem in attention mechanisms in Transformers. However, the paper finds that Mamba does not perform as well as convolutional and attention-based baseline models in visual tasks. The authors propose two hypotheses through analysis: 1. For image classification tasks, SSM is unnecessary as it does not require handling long sequences or autoregressive properties. 2. Although detection and segmentation tasks also lack autoregressive properties, exploring the potential of Mamba is still valuable due to their adherence to long sequence characteristics. To validate these hypotheses, the authors construct a set of models named MambaOut, which are stacked with gated convolutional blocks and remove the SSM. Experimental results show that MambaOut outperforms the visual Mamba model in ImageNet image classification, supporting the first hypothesis. However, in detection and segmentation tasks, MambaOut performs worse than state-of-the-art visual Mamba models, indicating that SSM may still have potential in such tasks, validating the second hypothesis. The main contributions of the paper include theoretical analysis of the task types suitable for Mamba, exploration of visual task features, and proposing and experimentally confirming the hypotheses regarding the necessity of Mamba in visual recognition. MambaOut, as a simplified version of the model, can serve as a foundation for future research on visual Mamba models.