Rethinking Scanning Strategies with Vision Mamba in Semantic Segmentation of Remote Sensing Imagery: An Experimental Study

Qinfeng Zhu,Yuan Fang,Yuanzhi Cai,Cheng Chen,Lei Fan
2024-05-14
Abstract:Deep learning methods, especially Convolutional Neural Networks (CNN) and Vision Transformer (ViT), are frequently employed to perform semantic segmentation of high-resolution remotely sensed images. However, CNNs are constrained by their restricted receptive fields, while ViTs face challenges due to their quadratic complexity. Recently, the Mamba model, featuring linear complexity and a global receptive field, has gained extensive attention for vision tasks. In such tasks, images need to be serialized to form sequences compatible with the Mamba model. Numerous research efforts have explored scanning strategies to serialize images, aiming to enhance the Mamba model's understanding of images. However, the effectiveness of these scanning strategies remains uncertain. In this research, we conduct a comprehensive experimental investigation on the impact of mainstream scanning directions and their combinations on semantic segmentation of remotely sensed images. Through extensive experiments on the LoveDA, ISPRS Potsdam, and ISPRS Vaihingen datasets, we demonstrate that no single scanning strategy outperforms others, regardless of their complexity or the number of scanning directions involved. A simple, single scanning direction is deemed sufficient for semantic segmentation of high-resolution remotely sensed images. Relevant directions for future research are also recommended.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the issue of whether different scanning strategies significantly impact the performance of the Mamba model in the task of semantic segmentation of high-resolution remote sensing images. Specifically, the paper experimentally studies the effects of mainstream scanning directions and their combinations on semantic segmentation performance and explores the effectiveness of these scanning strategies. ### Background and Problem 1. **Background**: - Deep learning methods, especially Convolutional Neural Networks (CNN) and Vision Transformers (ViT), are commonly used for semantic segmentation of high-resolution remote sensing images. - CNNs, due to their limited receptive field, find it challenging to capture long-range semantic dependencies in high-resolution images. - Although ViTs have a global receptive field, their quadratic complexity poses challenges when processing high-resolution images. - Recently, the Mamba model has gained attention for its linear complexity and global receptive field, being applied in visual tasks. 2. **Problem**: - In the Mamba model, images need to be serialized to form sequences compatible with the model. - Many studies have explored different scanning strategies to serialize images to enhance the Mamba model's understanding of images. - However, the effectiveness of these scanning strategies has not been fully validated. ### Research Objectives - To evaluate the impact of different mainstream scanning directions and their combinations on the semantic segmentation of high-resolution remote sensing images through extensive experiments. - To verify whether a specific scanning strategy can significantly improve the segmentation performance of the Mamba model. ### Experimental Design - Experiments were conducted using three datasets: LoveDA, ISPRS Potsdam, and ISPRS Vaihingen. - 22 scanning strategies were designed, including 12 individual scanning directions and 10 combined scanning directions. - Segmentation performance was evaluated using the mIoU (Mean Intersection over Union) metric. ### Main Findings - Experimental results show that the segmentation performance differences between different scanning strategies are minimal. - A single scanning direction (e.g., D1) is already effective, and complex multi-directional scanning does not bring significant performance improvements. - This indicates that in the task of semantic segmentation of high-resolution remote sensing images, the Mamba model is not sensitive to different scanning strategies. ### Conclusion - For semantic segmentation of high-resolution remote sensing images, using a simple single-direction scanning strategy (e.g., D1) is effective. - Complex multi-directional scanning strategies do not significantly improve segmentation performance, thus reducing computational demands and allowing for deeper networks to be built with limited computational resources. ### Future Work - Explore other methods to enhance the Mamba model's understanding of remote sensing images rather than relying solely on different scanning strategies. - Further investigate the potential applications of the Mamba model in other visual tasks.