MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation

Xianping Ma,Xiaokang Zhang,Man-On Pun,Bo Huang
2024-10-15
Abstract:Multimodal remote sensing data, collected from a variety of sensors, provide a comprehensive and integrated perspective of the Earth's surface. By employing multimodal fusion techniques, semantic segmentation offers more detailed insights into geographic scenes compared to single-modality approaches. Building upon recent advancements in vision foundation models, particularly the Segment Anything Model (SAM), this study introduces a novel Multimodal Adapter-based Network (MANet) for multimodal remote sensing semantic segmentation. At the core of this approach is the development of a Multimodal Adapter (MMAdapter), which fine-tunes SAM's image encoder to effectively leverage the model's general knowledge for multimodal data. In addition, a pyramid-based Deep Fusion Module (DFM) is incorporated to further integrate high-level geographic features across multiple scales before decoding. This work not only introduces a novel network for multimodal fusion, but also demonstrates, for the first time, SAM's powerful generalization capabilities with Digital Surface Model (DSM) data. Experimental results on two well-established fine-resolution multimodal remote sensing datasets, ISPRS Vaihingen and ISPRS Potsdam, confirm that the proposed MANet significantly surpasses current models in the task of multimodal semantic segmentation. The source code for this work will be accessible at <a class="link-external link-https" href="https://github.com/sstary/SSRS" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is how to leverage the powerful generalization capabilities of the large foundational model Segment Anything Model (SAM) to enhance the performance of multimodal remote sensing tasks. Specifically, the researchers face the following challenges: 1. **Multimodal Data Fusion**: Multimodal remote sensing data (such as optical images, multispectral images, hyperspectral images, and LiDAR data) provide a comprehensive view of the Earth's surface. Effectively fusing these different modalities to improve the accuracy of semantic segmentation is a key issue. 2. **Model Adaptability**: Existing multimodal fusion methods typically require extensive training for specific tasks, which is both time-consuming and resource-intensive. Efficiently adapting large foundational models (such as SAM) to multimodal remote sensing tasks while maintaining the model's generalization capabilities is an important research direction. 3. **Non-Optical Data Processing**: Particularly for non-optical data (such as Digital Surface Models, DSM), effectively combining them with optical data to further enhance model performance is also a challenge. To address these issues, the authors propose a new Multimodal Adapter (MMAdapter) and Multimodal Adapter Network (MANet). By fine-tuning SAM's image encoder and introducing a Deep Fusion Module (DFM) to integrate multi-scale geographic features, MANet achieves significant performance improvements in multimodal remote sensing semantic segmentation tasks. Experimental results show that MANet outperforms existing methods on the well-known multimodal remote sensing datasets ISPRS Vaihingen and ISPRS Potsdam.