Abstract:Due to the capability of dynamic state space models (SSMs) in capturing long-range dependencies with near-linear computational complexity, Mamba has shown notable performance in NLP tasks. This has inspired the rapid development of Mamba-based vision models, resulting in promising results in visual recognition tasks. However, such models are not capable of distilling features across layers through feature aggregation, interaction, and selection. Moreover, existing cross-layer feature aggregation methods designed for CNNs or ViTs are not practical in Mamba-based models due to high computational costs. Therefore, this paper aims to introduce an efficient cross-layer feature aggregation mechanism for Mamba-based vision backbone networks. Inspired by the Retinal Ganglion Cells (RGCs) in the human visual system, we propose a new sparse cross-layer connection mechanism termed SparX to effectively improve cross-layer feature interaction and reuse. Specifically, we build two different types of network layers: ganglion layers and normal layers. The former has higher connectivity and complexity, enabling multi-layer feature aggregation and interaction in an input-dependent manner. In contrast, the latter has lower connectivity and complexity. By interleaving these two types of layers, we design a new vision backbone network with sparsely cross-connected layers, achieving an excellent trade-off among model size, computational cost, memory cost, and accuracy in comparison to its counterparts. For instance, with fewer parameters, SparX-Mamba-T improves the top-1 accuracy of VMamba-T from 82.5% to 83.5%, while SparX-Swin-T achieves a 1.3% increase in top-1 accuracy compared to Swin-T. Extensive experimental results demonstrate that our new connection mechanism possesses both superior performance and generalization capabilities on various vision tasks.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the insufficient ability of existing Mamba - based visual models in cross - layer feature aggregation, interaction, and selection. Specifically, these existing models are unable to effectively distill features at different levels, resulting in an inability to fully utilize the powerful capabilities of dynamic state - space models (SSMs). Moreover, the existing cross - layer feature aggregation methods are not suitable for Mamba - based models due to their high computational cost. ### Background and Problem Description of the Paper 1. **Advantages and Limitations of the Mamba Model** - The Mamba model utilizes dynamic state - space models (SSMs) and can capture long - distance dependencies with near - linear computational complexity, thus performing well in natural language processing (NLP) tasks. - However, in visual tasks, Mamba - based models lack an effective cross - layer feature aggregation mechanism, which limits their performance improvement. 2. **Limitations of Existing Cross - Layer Feature Aggregation Methods** - Traditional cross - layer feature aggregation methods (such as those in DenseNet and FcaFormer) are effective, but when applied to Mamba - based models, they will cause a significant increase in computational cost, and even reduce the throughput by 50% and increase the GPU memory usage by more than 1GB. ### Solution To solve the above problems, the paper proposes a new sparse cross - layer connection mechanism called SparX. The design of SparX is inspired by the role of retinal ganglion cells (RGCs) in the human visual system and aims to improve cross - layer feature interaction and reuse in the following ways: - **Two Types of Network Layers**: Define "ganglion layers" and "ordinary layers". Ganglion layers have higher connectivity and complexity and can perform multi - layer feature aggregation and interaction in an input - dependent manner; ordinary layers have lower connectivity and complexity. - **Dynamic Multi - layer Channel Aggregator (DMCA)**: Introduce a new module for efficiently and selectively retrieving complementary features from previous layers and dynamically modeling multi - layer interactions. - **Cross - layer Sliding Window**: To further improve computational efficiency, limit each ganglion layer to establish connections only with the nearest several ganglion layers, thereby reducing the number of early feature maps to be stored and accessed. ### Experimental Results The experimental results show that SparX - Mamba has achieved significant performance improvements in multiple visual tasks. For example: - In the ImageNet - 1K classification task, SparX - Mamba - T increases the top - 1 accuracy from 82.5% to 83.5%. - In the semantic segmentation task, SparX - Mamba - T improves the mIoU by 1.7% compared to VMamba - T. - In the object detection task, SparX - Mamba - T improves the APb by 1.3% compared to VMamba - T. ### Main Contributions 1. Propose a new sparse cross - layer connection mechanism SparX inspired by retinal ganglion cells, which can dynamically configure cross - layer connections to promote information flow and feature distillation. 2. Based on Mamba and Transformer, construct two multi - functional visual backbone networks, SparX - Mamba and SparX - Swin. 3. Through extensive experiments, verify the excellent balance between performance and computational cost in the new architecture. Through these innovations, SparX significantly improves the performance of Mamba - based visual models while maintaining low computational and memory overheads.

SparX: A Sparse Cross-Layer Connection Mechanism for Hierarchical Vision Mamba and Transformer Networks

VMamba: Visual State Space Model

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

Demystify Mamba in Vision: A Linear Attention Perspective

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

LocalMamba: Visual State Space Model with Windowed Selective Scan

TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba

Mamba-R: Vision Mamba ALSO Needs Registers

A Survey on Vision Mamba: Models, Applications and Challenges

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models

Efficient Deep Spiking Multilayer Perceptrons With Multiplication-Free Inference