HRVMamba: High-Resolution Visual State Space Model for Dense Prediction

Hao Zhang,Yongqiang Ma,Wenqi Shao,Ping Luo,Nanning Zheng,Kaipeng Zhang
2024-10-04
Abstract:Recently, State Space Models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have demonstrated significant potential in computer vision tasks due to their linear computational complexity with respect to token length and their global receptive field. However, Mamba's performance on dense prediction tasks, including human pose estimation and semantic segmentation, has been constrained by three key challenges: insufficient inductive bias, long-range forgetting, and low-resolution output representation. To address these challenges, we introduce the Dynamic Visual State Space (DVSS) block, which utilizes multi-scale convolutional kernels to extract local features across different scales and enhance inductive bias, and employs deformable convolution to mitigate the long-range forgetting problem while enabling adaptive spatial aggregation based on input and task-specific information. By leveraging the multi-resolution parallel design proposed in HRNet, we introduce High-Resolution Visual State Space Model (HRVMamba) based on the DVSS block, which preserves high-resolution representations throughout the entire process while promoting effective multi-scale feature learning. Extensive experiments highlight HRVMamba's impressive performance on dense prediction tasks, achieving competitive results against existing benchmark models without bells and whistles. Code is available at <a class="link-external link-https" href="https://github.com/zhanghao5201/HRVMamba" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the performance deficiencies of the visual Mamba model in dense prediction tasks. Specifically, existing visual Mamba models face three key challenges when handling dense prediction tasks such as human pose estimation and semantic segmentation: 1. **Insufficient Inductive Bias**: Existing models process images by segmenting them into a series of patches (or tokens) and constructing a global receptive field through bidirectional or four-directional scanning mechanisms. While this method effectively handles long sequences, it disrupts the natural 2D spatial dependencies of images and lacks the inductive bias necessary for effective local representation learning. 2. **Long-Distance Forgetting Problem**: The Mamba model, when processing tokens, leads to the decay of previous hidden states, resulting in long-distance forgetting. This may cause the model to lose high-level, task-specific features related to the query patch and focus more on low-level edge features. 3. **Low-Resolution Output Representation**: Current visual Mamba models typically generate single-scale, low-resolution features, leading to significant information loss and difficulty in capturing the fine-grained details and multi-scale variations required for dense prediction tasks. To address these issues, the paper introduces the Dynamic Visual State Space (DVSS) block and proposes the High-Resolution Visual State Space Model (HRVMamba) based on it. By combining multi-scale convolutional kernels and deformable convolutions, the DVSS block enhances the model's inductive bias, alleviates the long-distance forgetting problem, and maintains high-resolution representations through a multi-resolution parallel design, making the model more suitable for dense prediction tasks. Experimental results show that HRVMamba achieves performance comparable to or even better than existing benchmark models in tasks such as image classification, human pose estimation, and semantic segmentation.