RS3Mamba: Visual State Space Model for Remote Sensing Images Semantic Segmentation

Xianping Ma,Xiaokang Zhang,Man-On Pun
2024-04-03
Abstract:Semantic segmentation of remote sensing images is a fundamental task in geoscience research. However, there are some significant shortcomings for the widely used convolutional neural networks (CNNs) and Transformers. The former is limited by its insufficient long-range modeling capabilities, while the latter is hampered by its computational complexity. Recently, a novel visual state space (VSS) model represented by Mamba has emerged, capable of modeling long-range relationships with linear computability. In this work, we propose a novel dual-branch network named remote sensing images semantic segmentation Mamba (RS3Mamba) to incorporate this innovative technology into remote sensing tasks. Specifically, RS3Mamba utilizes VSS blocks to construct an auxiliary branch, providing additional global information to convolution-based main branch. Moreover, considering the distinct characteristics of the two branches, we introduce a collaborative completion module (CCM) to enhance and fuse features from the dual-encoder. Experimental results on two widely used datasets, ISPRS Vaihingen and LoveDA Urban, demonstrate the effectiveness and potential of the proposed RS3Mamba. To the best of our knowledge, this is the first vision Mamba specifically designed for remote sensing images semantic segmentation. The source code will be made available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address issues in the task of semantic segmentation of remote sensing images. Currently, widely used Convolutional Neural Networks (CNNs) and Transformer models have some significant drawbacks when processing remote sensing images: 1. **Limitations of CNNs**: CNNs are limited by their local receptive fields, making it difficult to capture complex global information, which is a challenge for remote sensing images with complex scenes and large variations in object scales. 2. **Computational Complexity of Transformers**: Although Transformers can model long-range dependencies, their high computational complexity leads to issues in model efficiency and memory consumption. To address the above issues, the authors propose a new dual-branch network architecture named RS3Mamba, which leverages the Visual State Space (VSS) model to enhance feature extraction capabilities. Specifically, RS3Mamba includes a convolution-based main branch and an auxiliary branch that provides additional global information through VSS blocks. Furthermore, to fuse the feature differences between the two branches, a Collaborative Completion Module (CCM) is introduced to enhance and integrate features from the dual encoders. Experimental results show that RS3Mamba outperforms existing CNN and Transformer-based methods on the ISPRS Vaihingen and LoveDA Urban datasets.