Samba: Semantic Segmentation of Remotely Sensed Images with State Space Model

Qinfeng Zhu,Yuanzhi Cai,Yuan Fang,Yihan Yang,Cheng Chen,Lei Fan,Anh Nguyen
2024-04-11
Abstract:High-resolution remotely sensed images pose a challenge for commonly used semantic segmentation methods such as Convolutional Neural Network (CNN) and Vision Transformer (ViT). CNN-based methods struggle with handling such high-resolution images due to their limited receptive field, while ViT faces challenges in handling long sequences. Inspired by Mamba, which adopts a State Space Model (SSM) to efficiently capture global semantic information, we propose a semantic segmentation framework for high-resolution remotely sensed images, named Samba. Samba utilizes an encoder-decoder architecture, with Samba blocks serving as the encoder for efficient multi-level semantic information extraction, and UperNet functioning as the decoder. We evaluate Samba on the LoveDA, ISPRS Vaihingen, and ISPRS Potsdam datasets, comparing its performance against top-performing CNN and ViT methods. The results reveal that Samba achieved unparalleled performance on commonly used remote sensing datasets for semantic segmentation. Our proposed Samba demonstrates for the first time the effectiveness of SSM in semantic segmentation of remotely sensed images, setting a new benchmark in performance for Mamba-based techniques in this specific application. The source code and baseline implementations are available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the problem of semantic segmentation of high-resolution remote sensing images. Specifically: 1. **Limitations of existing methods**: - Methods based on Convolutional Neural Networks (CNN) are limited by their finite receptive field when processing high-resolution images. - Methods based on Vision Transformers (ViT) have a global attention mechanism but face exponentially increasing computational complexity when handling long sequences and require a large amount of training data. 2. **Proposed new method**: - The paper proposes a semantic segmentation framework named Samba, which utilizes the State Space Model (SSM) to efficiently capture global semantic information. - Samba adopts an encoder-decoder architecture, where the encoder part consists of Samba blocks, and the decoder uses UperNet. 3. **Experimental results**: - On several commonly used remote sensing image datasets (such as LoveDA, ISPRS Vaihingen, and ISPRS Potsdam), Samba significantly outperforms existing CNN and ViT methods, achieving the best performance in terms of mIoU metric. - Especially on the LoveDA dataset, Samba outperforms the best ViT method (Segformer) by 3.95% and the best CNN method (ConvNeXt) by 10.3%. 4. **Main contributions**: - For the first time, the Mamba architecture is applied to the task of semantic segmentation of remote sensing images. - Demonstrates the great potential of the Mamba architecture as a backbone network for remote sensing image semantic segmentation. - Establishes new performance benchmarks and provides directions for future research.