Abstract:Recent advancements in State Space Models, notably Mamba, have demonstrated superior performance over the dominant Transformer models, particularly in reducing the computational complexity from quadratic to linear. Yet, difficulties in adapting Mamba from language to vision tasks arise due to the distinct characteristics of visual data, such as the spatial locality and adjacency within images and large variations in information granularity across visual tokens. Existing vision Mamba approaches either flatten tokens into sequences in a raster scan fashion, which breaks the local adjacency of images, or manually partition tokens into windows, which limits their long-range modeling and generalization capabilities. To address these limitations, we present a new vision Mamba model, coined QuadMamba, that effectively captures local dependencies of varying granularities via quadtree-based image partition and scan. Concretely, our lightweight quadtree-based scan module learns to preserve the 2D locality of spatial regions within learned window quadrants. The module estimates the locality score of each token from their features, before adaptively partitioning tokens into window quadrants. An omnidirectional window shifting scheme is also introduced to capture more intact and informative features across different local regions. To make the discretized quadtree partition end-to-end trainable, we further devise a sequence masking strategy based on Gumbel-Softmax and its straight-through gradient estimator. Extensive experiments demonstrate that QuadMamba achieves state-of-the-art performance in various vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code is in <a class="link-external link-https" href="https://github.com/VISION-SJTU/QuadMamba" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to effectively utilize the Mamba model in visual tasks, particularly how to overcome issues such as spatial locality and information granularity differences encountered by existing methods when processing image data. Specifically, existing visual Mamba methods either flatten tokens in images into sequences, which disrupts the local proximity of the images, or manually partition tokens into windows, which limits their ability to model long-range dependencies and generalization capabilities. To overcome these limitations, the paper proposes a new visual Mamba model—QuadMamba, which effectively captures local dependencies of different granularities through quadtree-based image segmentation and scanning. ### Main Contributions: 1. **Quadtree-based Scanning Module**: A lightweight quadtree scanning module that retains 2D locality within spatial regions and adaptively partitions tokens into window quadrants based on feature-estimated local scores. 2. **All-direction Window Shifting Scheme**: Introduces an all-direction window shifting scheme to capture more complete and informative features of different local regions. 3. **End-to-end Trainable Discretized Quadtree Partitioning**: Designs a sequence mask strategy through Gumbel-Softmax and its straight-through gradient estimator, making the discretized quadtree partitioning end-to-end trainable. 4. **Extensive Experimental Validation**: Conducts extensive experiments on multiple visual tasks (such as image classification, object detection, instance segmentation, and semantic segmentation), demonstrating that QuadMamba achieves state-of-the-art performance. ### Problems Addressed: - **Preservation of Spatial Locality**: Through quadtree segmentation and scanning, QuadMamba better preserves the spatial locality of images, avoiding the loss of local information caused by flattening or fixed window partitioning in traditional methods. - **Multi-granularity Modeling**: Through recursive segmentation and fine-grained scanning, QuadMamba can model local dependencies at different granularities, thereby better handling multi-scale information in images. - **Flexibility and Generalization Capability**: Compared to manual partitioning methods, QuadMamba's adaptive partitioning strategy offers higher flexibility, better handling objects of different scales and ignoring unimportant regions. Overall, QuadMamba significantly enhances the performance and applicability of the Mamba model in visual tasks through its innovative quadtree scanning mechanism.

QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

LocalMamba: Visual State Space Model with Windowed Selective Scan

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

A Survey on Visual Mamba

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Visual Mamba: A Survey and New Outlooks

VMamba: Visual State Space Model

A Survey on Vision Mamba: Models, Applications and Challenges

Vision Mamba: A Comprehensive Survey and Taxonomy

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification

PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

V2M: Visual 2-Dimensional Mamba for Image Representation Learning

GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model

Mamba in Vision: A Comprehensive Survey of Techniques and Applications

VideoMamba: Spatio-Temporal Selective State Space Model

MambaVC: Learned Visual Compression with Selective State Spaces

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

MambaVision: A Hybrid Mamba-Transformer Vision Backbone