SparX: A Sparse Cross-Layer Connection Mechanism for Hierarchical Vision Mamba and Transformer Networks

Meng Lou,Yunxiang Fu,Yizhou Yu
2024-09-15
Abstract:Due to the capability of dynamic state space models (SSMs) in capturing long-range dependencies with near-linear computational complexity, Mamba has shown notable performance in NLP tasks. This has inspired the rapid development of Mamba-based vision models, resulting in promising results in visual recognition tasks. However, such models are not capable of distilling features across layers through feature aggregation, interaction, and selection. Moreover, existing cross-layer feature aggregation methods designed for CNNs or ViTs are not practical in Mamba-based models due to high computational costs. Therefore, this paper aims to introduce an efficient cross-layer feature aggregation mechanism for Mamba-based vision backbone networks. Inspired by the Retinal Ganglion Cells (RGCs) in the human visual system, we propose a new sparse cross-layer connection mechanism termed SparX to effectively improve cross-layer feature interaction and reuse. Specifically, we build two different types of network layers: ganglion layers and normal layers. The former has higher connectivity and complexity, enabling multi-layer feature aggregation and interaction in an input-dependent manner. In contrast, the latter has lower connectivity and complexity. By interleaving these two types of layers, we design a new vision backbone network with sparsely cross-connected layers, achieving an excellent trade-off among model size, computational cost, memory cost, and accuracy in comparison to its counterparts. For instance, with fewer parameters, SparX-Mamba-T improves the top-1 accuracy of VMamba-T from 82.5% to 83.5%, while SparX-Swin-T achieves a 1.3% increase in top-1 accuracy compared to Swin-T. Extensive experimental results demonstrate that our new connection mechanism possesses both superior performance and generalization capabilities on various vision tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the insufficient ability of existing Mamba - based visual models in cross - layer feature aggregation, interaction, and selection. Specifically, these existing models are unable to effectively distill features at different levels, resulting in an inability to fully utilize the powerful capabilities of dynamic state - space models (SSMs). Moreover, the existing cross - layer feature aggregation methods are not suitable for Mamba - based models due to their high computational cost. ### Background and Problem Description of the Paper 1. **Advantages and Limitations of the Mamba Model** - The Mamba model utilizes dynamic state - space models (SSMs) and can capture long - distance dependencies with near - linear computational complexity, thus performing well in natural language processing (NLP) tasks. - However, in visual tasks, Mamba - based models lack an effective cross - layer feature aggregation mechanism, which limits their performance improvement. 2. **Limitations of Existing Cross - Layer Feature Aggregation Methods** - Traditional cross - layer feature aggregation methods (such as those in DenseNet and FcaFormer) are effective, but when applied to Mamba - based models, they will cause a significant increase in computational cost, and even reduce the throughput by 50% and increase the GPU memory usage by more than 1GB. ### Solution To solve the above problems, the paper proposes a new sparse cross - layer connection mechanism called SparX. The design of SparX is inspired by the role of retinal ganglion cells (RGCs) in the human visual system and aims to improve cross - layer feature interaction and reuse in the following ways: - **Two Types of Network Layers**: Define "ganglion layers" and "ordinary layers". Ganglion layers have higher connectivity and complexity and can perform multi - layer feature aggregation and interaction in an input - dependent manner; ordinary layers have lower connectivity and complexity. - **Dynamic Multi - layer Channel Aggregator (DMCA)**: Introduce a new module for efficiently and selectively retrieving complementary features from previous layers and dynamically modeling multi - layer interactions. - **Cross - layer Sliding Window**: To further improve computational efficiency, limit each ganglion layer to establish connections only with the nearest several ganglion layers, thereby reducing the number of early feature maps to be stored and accessed. ### Experimental Results The experimental results show that SparX - Mamba has achieved significant performance improvements in multiple visual tasks. For example: - In the ImageNet - 1K classification task, SparX - Mamba - T increases the top - 1 accuracy from 82.5% to 83.5%. - In the semantic segmentation task, SparX - Mamba - T improves the mIoU by 1.7% compared to VMamba - T. - In the object detection task, SparX - Mamba - T improves the APb by 1.3% compared to VMamba - T. ### Main Contributions 1. Propose a new sparse cross - layer connection mechanism SparX inspired by retinal ganglion cells, which can dynamically configure cross - layer connections to promote information flow and feature distillation. 2. Based on Mamba and Transformer, construct two multi - functional visual backbone networks, SparX - Mamba and SparX - Swin. 3. Through extensive experiments, verify the excellent balance between performance and computational cost in the new architecture. Through these innovations, SparX significantly improves the performance of Mamba - based visual models while maintaining low computational and memory overheads.