MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation

Xianping Ma,Xiaokang Zhang,Man-On Pun,Bo Huang

2024-10-15

Abstract:Multimodal remote sensing data, collected from a variety of sensors, provide a comprehensive and integrated perspective of the Earth's surface. By employing multimodal fusion techniques, semantic segmentation offers more detailed insights into geographic scenes compared to single-modality approaches. Building upon recent advancements in vision foundation models, particularly the Segment Anything Model (SAM), this study introduces a novel Multimodal Adapter-based Network (MANet) for multimodal remote sensing semantic segmentation. At the core of this approach is the development of a Multimodal Adapter (MMAdapter), which fine-tunes SAM's image encoder to effectively leverage the model's general knowledge for multimodal data. In addition, a pyramid-based Deep Fusion Module (DFM) is incorporated to further integrate high-level geographic features across multiple scales before decoding. This work not only introduces a novel network for multimodal fusion, but also demonstrates, for the first time, SAM's powerful generalization capabilities with Digital Surface Model (DSM) data. Experimental results on two well-established fine-resolution multimodal remote sensing datasets, ISPRS Vaihingen and ISPRS Potsdam, confirm that the proposed MANet significantly surpasses current models in the task of multimodal semantic segmentation. The source code for this work will be accessible at <a class="link-external link-https" href="https://github.com/sstary/SSRS" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem this paper attempts to address is how to leverage the powerful generalization capabilities of the large foundational model Segment Anything Model (SAM) to enhance the performance of multimodal remote sensing tasks. Specifically, the researchers face the following challenges: 1. **Multimodal Data Fusion**: Multimodal remote sensing data (such as optical images, multispectral images, hyperspectral images, and LiDAR data) provide a comprehensive view of the Earth's surface. Effectively fusing these different modalities to improve the accuracy of semantic segmentation is a key issue. 2. **Model Adaptability**: Existing multimodal fusion methods typically require extensive training for specific tasks, which is both time-consuming and resource-intensive. Efficiently adapting large foundational models (such as SAM) to multimodal remote sensing tasks while maintaining the model's generalization capabilities is an important research direction. 3. **Non-Optical Data Processing**: Particularly for non-optical data (such as Digital Surface Models, DSM), effectively combining them with optical data to further enhance model performance is also a challenge. To address these issues, the authors propose a new Multimodal Adapter (MMAdapter) and Multimodal Adapter Network (MANet). By fine-tuning SAM's image encoder and introducing a Deep Fusion Module (DFM) to integrate multi-scale geographic features, MANet achieves significant performance improvements in multimodal remote sensing semantic segmentation tasks. Experimental results show that MANet outperforms existing methods on the well-known multimodal remote sensing datasets ISPRS Vaihingen and ISPRS Potsdam.

MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation

Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images

Multi-Attention-Network for Semantic Segmentation of Fine Resolution Remote Sensing Images

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation

MeSAM: Multiscale Enhanced Segment Anything Model for Optical Remote Sensing Images

MANet: a multi-level aggregation network for semantic segmentation of high-resolution remote sensing images

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

RSAM-Seg: A SAM-based Approach with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation

MDMASNet: A dual-task interactive semi-supervised remote sensing image segmentation method

FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

MultiSenseSeg: A Cost-Effective Unified Multimodal Semantic Segmentation Model for Remote Sensing

Encoder- and Decoder-Based Networks Using Multiscale Feature Fusion and Nonlocal Block for Remote Sensing Image Semantic Segmentation

Segment Anything with Multiple Modalities

Deep Multimodal Fusion Network for Semantic Segmentation Using Remote Sensing Image and LiDAR Data

SAM-Assisted Remote Sensing Imagery Semantic Segmentation with Object and Boundary Constraints

Semantic Segmentation of Urban Airborne LiDAR Point Clouds Based on Fusion Attention Mechanism and Multi-Scale Features

Semantic Attention and Structured Model for Weakly Supervised Instance Segmentation in Optical and SAR Remote Sensing Imagery

CDMANet: central difference mutual attention network for RGB-D semantic segmentation

AANet: Adaptive Attention Networks for Semantic Segmentation of High-Resolution Remote Sensing Imagery

A Synergistical Attention Model for Semantic Segmentation of Remote Sensing Images