Abstract:Recently, State Space Models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have demonstrated significant potential in computer vision tasks due to their linear computational complexity with respect to token length and their global receptive field. However, Mamba's performance on dense prediction tasks, including human pose estimation and semantic segmentation, has been constrained by three key challenges: insufficient inductive bias, long-range forgetting, and low-resolution output representation. To address these challenges, we introduce the Dynamic Visual State Space (DVSS) block, which utilizes multi-scale convolutional kernels to extract local features across different scales and enhance inductive bias, and employs deformable convolution to mitigate the long-range forgetting problem while enabling adaptive spatial aggregation based on input and task-specific information. By leveraging the multi-resolution parallel design proposed in HRNet, we introduce High-Resolution Visual State Space Model (HRVMamba) based on the DVSS block, which preserves high-resolution representations throughout the entire process while promoting effective multi-scale feature learning. Extensive experiments highlight HRVMamba's impressive performance on dense prediction tasks, achieving competitive results against existing benchmark models without bells and whistles. Code is available at <a class="link-external link-https" href="https://github.com/zhanghao5201/HRVMamba" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the performance deficiencies of the visual Mamba model in dense prediction tasks. Specifically, existing visual Mamba models face three key challenges when handling dense prediction tasks such as human pose estimation and semantic segmentation: 1. **Insufficient Inductive Bias**: Existing models process images by segmenting them into a series of patches (or tokens) and constructing a global receptive field through bidirectional or four-directional scanning mechanisms. While this method effectively handles long sequences, it disrupts the natural 2D spatial dependencies of images and lacks the inductive bias necessary for effective local representation learning. 2. **Long-Distance Forgetting Problem**: The Mamba model, when processing tokens, leads to the decay of previous hidden states, resulting in long-distance forgetting. This may cause the model to lose high-level, task-specific features related to the query patch and focus more on low-level edge features. 3. **Low-Resolution Output Representation**: Current visual Mamba models typically generate single-scale, low-resolution features, leading to significant information loss and difficulty in capturing the fine-grained details and multi-scale variations required for dense prediction tasks. To address these issues, the paper introduces the Dynamic Visual State Space (DVSS) block and proposes the High-Resolution Visual State Space Model (HRVMamba) based on it. By combining multi-scale convolutional kernels and deformable convolutions, the DVSS block enhances the model's inductive bias, alleviates the long-distance forgetting problem, and maintains high-resolution representations through a multi-resolution parallel design, making the model more suitable for dense prediction tasks. Experimental results show that HRVMamba achieves performance comparable to or even better than existing benchmark models in tasks such as image classification, human pose estimation, and semantic segmentation.

HRVMamba: High-Resolution Visual State Space Model for Dense Prediction

VMamba: Visual State Space Model

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

OccMamba: Semantic Occupancy Prediction with State Space Models

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

V2M: Visual 2-Dimensional Mamba for Image Representation Learning

GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model

QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification

PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

MambaOcc: Visual State Space Model for BEV-based Occupancy Prediction with Local Adaptive Reordering

PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution

RS-Mamba for Large Remote Sensing Image Dense Prediction