UNetMamba: An Efficient UNet-Like Mamba for Semantic Segmentation of High-Resolution Remote Sensing Images

Enze Zhu,Zhan Chen,Dingkai Wang,Hanru Shi,Xiaoxuan Liu,Lei Wang
2024-10-21
Abstract:Semantic segmentation of high-resolution remote sensing images is vital in downstream applications such as land-cover mapping, urban planning and disaster <a class="link-external link-http" href="http://assessment.Existing" rel="external noopener nofollow">this http URL</a> Transformer-based methods suffer from the constraint between accuracy and efficiency, while the recently proposed Mamba is renowned for being efficient. Therefore, to overcome the dilemma, we propose UNetMamba, a UNet-like semantic segmentation model based on Mamba. It incorporates a mamba segmentation decoder (MSD) that can efficiently decode the complex information within high-resolution images, and a local supervision module (LSM), which is train-only but can significantly enhance the perception of local contents. Extensive experiments demonstrate that UNetMamba outperforms the state-of-the-art methods with mIoU increased by 0.87% on LoveDA and 0.39% on ISPRS Vaihingen, while achieving high efficiency through the lightweight design, less memory footprint and reduced computational cost. The source code is available at <a class="link-external link-https" href="https://github.com/EnzeZhu2001/UNetMamba" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the contradiction between accuracy and efficiency in semantic segmentation of high-resolution remote sensing images. Specifically, existing Transformer-based methods, although significantly improving accuracy, have high computational complexity and a large number of parameters, resulting in low efficiency when processing high-resolution images. On the other hand, the recently proposed Mamba model is efficient but its performance on specific tasks has not been fully validated. To solve this problem, the authors propose a UNet-like model based on Mamba—UNetMamba. This model achieves efficient semantic segmentation through the following three main components: 1. **Encoder**: Uses the ResT backbone network to capture multi-scale feature maps through an Efficient Multi-head Self-Attention mechanism (EMSA). 2. **Mamba Segmentation Decoder (MSD)**: Applies the basic unit of Mamba (VSS block) on the decoding side to efficiently decode complex information with linear complexity. 3. **Local Supervision Module (LSM)**: Enhances the perception of local semantic information through two convolutional branches of different scales and an auxiliary loss function. Experimental results show that UNetMamba not only achieves higher accuracy (mIoU increased by 0.87% and 0.39% respectively) on the LoveDA and ISPRS Vaihingen high-resolution remote sensing image datasets, but also performs excellently in lightweight design, low memory usage, and low computational cost.