Abstract:Distributed rooftop photovoltaic systems hold immense potential for renewable energy generation, and accurate extraction of building roofs from high-resolution remote sensing imagery is crucial for their development. While current semantic segmentation methods primarily rely on single-modality optical images, Synthetic Aperture Radar (SAR) offers complementary ground information that can significantly enhance segmentation accuracy. However, the modality disparities arising from different imaging mechanisms pose challenges in feature fusion between SAR and optical images, existing approaches rely on simplistic fusion methods to exploit the complementary information of each modality, ignoring the correlative information between the different modalities during feature extraction, this results in an insufficient integration of complementary information.To address these challenges, we introduce CMMSNet, a novel multi-modal fusion semantic segmentation network specifically designed for building roof extraction. The main architecture of CMMSNet is constituted by three three core modules: the feature extraction encoder module, the heterogeneous modality alignment module, and the modality fusion decoder module. Initially, dual independent pyramid-structured encoders are employed by CMMSNet to separately extract feature pyramids from SAR and optical images at various scales, this strategy is intended to capture multi-scale semantic contexts and address the issue of large spatial scale variations among different objects in remote sensing images. Furthermore, an Adaptive Feature Alignment Module (AFAM) is introduced, tasked with identifying correlative information between the two modalities from a spatial dimension and aligning the modal features accordingly, this process is crucial for facilitating cross-modal learning and in enhancing the feature representation of each modality. In addition, a Cross-Modal Multi-Scale Feature Fusion (CMMSFF) module is designed to effectively integrates multi-scale and multi-modal heterogeneous features from both modalities, this module employs a channel self-attention mechanism, which adaptively fuses discriminative features by applying weights to each modality and selectively discarding irrelevant components, thus enhancing the selection and fusion of key channels within the multimodal features set. This innovative approach allows us to harness the complementary information provided by SAR and optical images, enhancing the overall segmentation performance. A series of comprehensive experiments conducted on the DFC23 dataset demonstrate that our proposed CMMSNet outperforms other existing mainstream semantic segmentation methods in both stability and effectiveness, including both single-modal and multi-modal approaches. This achievement sets a new benchmark for the extraction of building rooftop through the use of multi-modal remote sensing images. Our findings highlight the importance of leveraging multi-modal data fusion in addressing real-world challenges in remote sensing image analysis and offer valuable insights for future research in this domain.

Land Use Classification Via Multi-Modal Complementary Feature Fusion and Context Information Enhancement for Optical and Sar Images

Learning SAR-Optical Cross Modal Features for Land Cover Classification

MFFnet: Multimodal Feature Fusion Network for Synthetic Aperture Radar and Optical Image Land Cover Classification

OPT-SAR-MS2Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images

Multimodal Semantic Consistency-Based Fusion Architecture Search for Land Cover Classification

Multimodal Bilinear Fusion Network with Second-Order Attention-Based Channel Selection for Land Cover Classification

Multi-Modal Fusion Architecture Search for Land Cover Classification Using Heterogeneous Remote Sensing Images.

Self-supervised SAR-optical Data Fusion and Land-cover Mapping using Sentinel-1/-2 Images

Optical and SAR Image Fusion Based on Complementary Feature Decomposition and Visual Saliency Features

CFNet: A Cross Fusion Network for Joint Land Cover Classification Using Optical and SAR Images

Emmcnn: An Etps-Based Multi-Scale And Multi-Feature Method Using Cnn For High Spatial Resolution Image Land-Cover Classification

An Effective Multi-model Fusion Method for SAR and Optical Remote Sensing Images

Semantic Representation Fusion-Based Network for Robust Land Cover Classification in Foggy Conditions.

CMMSNet:A Multi-modal Semantic Segmentation Network for Rooftop Extraction Based on SAR and Optical Images

Collaborative Attention-Based Heterogeneous Gated Fusion Network for Land Cover Classification

CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

Land cover classification algorithm based on multi-modal collaboration and boundary-guided fusion

OPTICAL AND SAR IMAGE FUSION BASED ON VISUAL SALIENCY FEATURES

Integration of Convolutional Neural Networks and Object-Based Post-Classification Refinement for Land Use and Land Cover Mapping with Optical and SAR Data

SAR Image Classification Using Fully Connected Conditional Random Fields Combined with Deep Learning and Superpixel Boundary Constraint

SFMRNet: Specific Feature Fusion and Multibranch Feature Refinement Network for Land Use Classification