Abstract:Selective state space models (SSMs), such as Mamba, highly excel at capturing long-range dependencies in 1D sequential data, while their applications to 2D vision tasks still face challenges. Current visual SSMs often convert images into 1D sequences and employ various scanning patterns to incorporate local spatial dependencies. However, these methods are limited in effectively capturing the complex image spatial structures and the increased computational cost caused by the lengthened scanning paths. To address these limitations, we propose Spatial-Mamba, a novel approach that establishes neighborhood connectivity directly in the state space. Instead of relying solely on sequential state transitions, we introduce a structure-aware state fusion equation, which leverages dilated convolutions to capture image spatial structural dependencies, significantly enhancing the flow of visual contextual information. Spatial-Mamba proceeds in three stages: initial state computation in a unidirectional scan, spatial context acquisition through structure-aware state fusion, and final state computation using the observation equation. Our theoretical analysis shows that Spatial-Mamba unifies the original Mamba and linear attention under the same matrix multiplication framework, providing a deeper understanding of our method. Experimental results demonstrate that Spatial-Mamba, even with a single scan, attains or surpasses the state-of-the-art SSM-based models in image classification, detection and segmentation. Source codes and trained models can be found at $\href{<a class="link-external link-https" href="https://github.com/EdwardChasel/Spatial-Mamba" rel="external noopener nofollow">this https URL</a>}{\text{this https URL}}$.
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve
This paper aims to address the issues faced by existing Visual State Space Models (VSSMs) when handling 2D visual tasks. Specifically:
1. **Insufficient Spatial Structure Capture**: Existing VSSMs typically convert images into 1D sequences and use different scanning patterns to introduce local spatial dependencies. However, these methods have limitations in effectively capturing the complex spatial structures of images.
2. **Increased Computational Cost**: Due to the extended scanning paths, these methods lead to increased computational costs.
3. **Direction Sensitivity**: Existing scanning strategies (such as bidirectional scanning, continuous scanning, etc.) alter the spatial relationships between pixels, disrupting the inherent spatial context of the image.
To overcome these issues, the authors propose **Spatial-Mamba**, a new approach that introduces structural-aware state fusion equations by directly establishing neighborhood connectivity in the state space. This method utilizes dilated convolutions to capture the spatial structure dependencies of images, significantly enhancing the flow of visual contextual information.
### Main Contributions
1. **Structural-Aware State Fusion**: By introducing structural-aware state fusion equations, Spatial-Mamba can more effectively capture spatial dependencies in images.
2. **Unidirectional Scanning**: Unlike existing methods that require multiple scanning directions, Spatial-Mamba achieves efficient spatial information fusion with only one unidirectional scan.
3. **Theoretical Unification**: The authors theoretically unify Spatial-Mamba with the original Mamba and linear attention mechanisms through a matrix multiplication framework, providing a deeper understanding of the method.
4. **Experimental Validation**: Experimental results show that even with a single scan, Spatial-Mamba achieves or exceeds the performance of existing state-of-the-art models in basic visual tasks such as image classification, detection, and segmentation.
### Experimental Results
1. **Image Classification**: On the ImageNet-1K dataset, Spatial-Mamba-T, Spatial-Mamba-S, and Spatial-Mamba-B achieved Top-1 accuracies of 83.5%, 84.6%, and 85.3%, respectively, significantly outperforming other CNN and Transformer-based methods.
2. **Object Detection and Instance Segmentation**: On the COCO dataset, Spatial-Mamba performed excellently under different training schedules, especially under the 1× schedule, where Spatial-Mamba-T achieved 47.6% box mAP and 42.9% mask mAP, surpassing other methods.
3. **Semantic Segmentation**: On the ADE20K dataset, Spatial-Mamba achieved high mIoU scores in both single-scale and multi-scale tests, demonstrating its effectiveness in semantic segmentation tasks.
In summary, Spatial-Mamba effectively addresses the limitations of existing VSSMs in handling 2D visual tasks by introducing a structural-aware state fusion mechanism, significantly improving model performance and efficiency.