CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation

Mushui Liu,Jun Dan,Ziqian Lu,Yunlong Yu,Yingming Li,Xi Li
2024-05-17
Abstract:Due to the large-scale image size and object variations, current CNN-based and Transformer-based approaches for remote sensing image semantic segmentation are suboptimal for capturing the long-range dependency or limited to the complex computational complexity. In this paper, we propose CM-UNet, comprising a CNN-based encoder for extracting local image features and a Mamba-based decoder for aggregating and integrating global information, facilitating efficient semantic segmentation of remote sensing images. Specifically, a CSMamba block is introduced to build the core segmentation decoder, which employs channel and spatial attention as the gate activation condition of the vanilla Mamba to enhance the feature interaction and global-local information fusion. Moreover, to further refine the output features from the CNN encoder, a Multi-Scale Attention Aggregation (MSAA) module is employed to merge the different scale features. By integrating the CSMamba block and MSAA module, CM-UNet effectively captures the long-range dependencies and multi-scale global contextual information of large-scale remote-sensing images. Experimental results obtained on three benchmarks indicate that the proposed CM-UNet outperforms existing methods in various performance metrics. The codes are available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address issues in semantic segmentation of remote sensing images, particularly when dealing with large-scale images and target variations. Existing methods based on Convolutional Neural Networks (CNN) and Transformer-based approaches are either ineffective in capturing long-range dependencies or have high computational complexity. To this end, the paper proposes a new framework called CM-UNet, which combines a CNN encoder to extract local image features and a Mamba-based decoder to aggregate and fuse global information, thereby achieving efficient semantic segmentation of remote sensing images. Specifically, CM-UNet introduces a CSMamba block as the core segmentation decoder, utilizing channel and spatial attention mechanisms to enhance feature interaction and global-local information fusion. Additionally, to further optimize the features output by the CNN encoder, a Multi-Scale Attention Aggregation (MSAA) module is employed to fuse features at different scales. By integrating the CSMamba block and the MSAA module, CM-UNet effectively captures long-range dependencies and multi-scale global contextual information in large-scale remote sensing images. Experimental results demonstrate that CM-UNet outperforms existing methods on multiple benchmark datasets.