PyramidMamba: Rethinking Pyramid Feature Fusion with Selective Space State Model for Semantic Segmentation of Remote Sensing Imagery

Libo Wang,Dongxu Li,Sijun Dong,Xiaoliang Meng,Xiaokang Zhang,Danfeng Hong
2024-06-16
Abstract:Semantic segmentation, as a basic tool for intelligent interpretation of remote sensing images, plays a vital role in many Earth Observation (EO) applications. Nowadays, accurate semantic segmentation of remote sensing images remains a challenge due to the complex spatial-temporal scenes and multi-scale geo-objects. Driven by the wave of deep learning (DL), CNN- and Transformer-based semantic segmentation methods have been explored widely, and these two architectures both revealed the importance of multi-scale feature representation for strengthening semantic information of geo-objects. However, the actual multi-scale feature fusion often comes with the semantic redundancy issue due to homogeneous semantic contents in pyramid features. To handle this issue, we propose a novel Mamba-based segmentation network, namely PyramidMamba. Specifically, we design a plug-and-play decoder, which develops a dense spatial pyramid pooling (DSPP) to encode rich multi-scale semantic features and a pyramid fusion Mamba (PFM) to reduce semantic redundancy in multi-scale feature fusion. Comprehensive ablation experiments illustrate the effectiveness and superiority of the proposed method in enhancing multi-scale feature representation as well as the great potential for real-time semantic segmentation. Moreover, our PyramidMamba yields state-of-the-art performance on three publicly available datasets, i.e. the OpenEarthMap (70.8% mIoU), ISPRS Vaihingen (84.8% mIoU) and Potsdam (88.0% mIoU) datasets. The code will be available at <a class="link-external link-https" href="https://github.com/WangLibo1995/GeoSeg" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of how to effectively aggregate multi-scale features to improve segmentation accuracy in remote sensing image semantic segmentation while reducing semantic redundancy in feature fusion. Specifically, the paper points out: 1. **Challenges of Multi-Scale Feature Representation**: Existing methods based on Convolutional Neural Networks (CNN) and Transformers have limitations in handling multi-scale features. CNN methods usually result in coarse segmentation due to a single and limited receptive field; while Transformers, although capable of capturing global contextual information, have high computational complexity and low efficiency. 2. **Redundancy Problem in Multi-Scale Feature Fusion**: During the multi-scale feature fusion process, a large amount of homogeneous semantic information in the pyramid features leads to poor feature fusion effects, affecting the final segmentation performance. To address these issues, the paper proposes a novel segmentation network based on the Mamba architecture—PyramidMamba. This network effectively enhances multi-scale feature representation and reduces redundancy in feature fusion by designing a Dense Spatial Pyramid Pooling (DSPP) module and a Pyramid Fusion Mamba (PFM) module. The specific contributions include: 1. **Rethinking Pyramid Feature Fusion Scheme**: Proposing a Mamba-based segmentation network (PyramidMamba) to improve multi-scale feature representation. 2. **Designing a Mamba-Based Decoder**: Applying dense spatial pooling to generate more fine-grained multi-scale context and utilizing Mamba's selective characteristics to effectively reduce homogeneous semantic information, while also demonstrating potential in building real-time semantic segmentation networks. 3. **Experimental Validation**: Conducting comprehensive experiments on three widely used remote sensing image semantic segmentation datasets, showing that PyramidMamba achieves accuracy levels comparable to existing state-of-the-art methods.