VMamba: Visual State Space Model

Yue Liu,Yunjie Tian,Yuzhong Zhao,Hongtian Yu,Lingxi Xie,Yaowei Wang,Qixiang Ye,Yunfan Liu
2024-04-10
Abstract:Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have long been the predominant backbone networks for visual representation learning. While ViTs have recently gained prominence over CNNs due to their superior fitting capabilities, their scalability is largely constrained by the quadratic complexity of attention computation. Inspired by the capability of Mamba in efficiently modeling long sequences, we propose VMamba, a generic vision backbone model aiming to reduce the computational complexity to linear while retaining ViTs' advantageous features. To enhance VMamba's adaptability in processing vision data, we introduce the Cross-Scan Module (CSM) to enable 1D selective scanning in 2D image space with global receptive fields. Additionally, we make further improvements in implementation details and architectural designs to enhance VMamba's performance and boost its inference speed. Extensive experimental results demonstrate VMamba's promising performance across various visual perception tasks, highlighting its pronounced advantages in input scaling efficiency compared to existing benchmark models. Source code is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper proposes a new method to address the high computational complexity issue of the Transformer model in visual representation learning. Currently, although the Transformer performs well in visual tasks, its self-attention mechanism leads to a quadratic growth in computational complexity with respect to the input size, limiting its scalability. Inspired by the linear complexity of State Space Models (SSMs) in long sequence modeling, the paper introduces the VMamba model, aiming to reduce the computational complexity to linear while retaining the global receptive field and dynamic weighting advantages of the Transformer. VMamba adapts to 2D image space by introducing the Cross-Scan Module (CSM) to achieve 1D selective scanning, which enlarges the effective receptive field. Experiments show that VMamba performs excellently in various visual tasks, especially in terms of input size efficiency compared to existing benchmark models.