Abstract:Distributed rooftop photovoltaic systems hold immense potential for renewable energy generation, and accurate extraction of building roofs from high-resolution remote sensing imagery is crucial for their development. While current semantic segmentation methods primarily rely on single-modality optical images, Synthetic Aperture Radar (SAR) offers complementary ground information that can significantly enhance segmentation accuracy. However, the modality disparities arising from different imaging mechanisms pose challenges in feature fusion between SAR and optical images, existing approaches rely on simplistic fusion methods to exploit the complementary information of each modality, ignoring the correlative information between the different modalities during feature extraction, this results in an insufficient integration of complementary information.To address these challenges, we introduce CMMSNet, a novel multi-modal fusion semantic segmentation network specifically designed for building roof extraction. The main architecture of CMMSNet is constituted by three three core modules: the feature extraction encoder module, the heterogeneous modality alignment module, and the modality fusion decoder module. Initially, dual independent pyramid-structured encoders are employed by CMMSNet to separately extract feature pyramids from SAR and optical images at various scales, this strategy is intended to capture multi-scale semantic contexts and address the issue of large spatial scale variations among different objects in remote sensing images. Furthermore, an Adaptive Feature Alignment Module (AFAM) is introduced, tasked with identifying correlative information between the two modalities from a spatial dimension and aligning the modal features accordingly, this process is crucial for facilitating cross-modal learning and in enhancing the feature representation of each modality. In addition, a Cross-Modal Multi-Scale Feature Fusion (CMMSFF) module is designed to effectively integrates multi-scale and multi-modal heterogeneous features from both modalities, this module employs a channel self-attention mechanism, which adaptively fuses discriminative features by applying weights to each modality and selectively discarding irrelevant components, thus enhancing the selection and fusion of key channels within the multimodal features set. This innovative approach allows us to harness the complementary information provided by SAR and optical images, enhancing the overall segmentation performance. A series of comprehensive experiments conducted on the DFC23 dataset demonstrate that our proposed CMMSNet outperforms other existing mainstream semantic segmentation methods in both stability and effectiveness, including both single-modal and multi-modal approaches. This achievement sets a new benchmark for the extraction of building rooftop through the use of multi-modal remote sensing images. Our findings highlight the importance of leveraging multi-modal data fusion in addressing real-world challenges in remote sensing image analysis and offer valuable insights for future research in this domain.

CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

Cross Attention-Based Multi-Scale Convolutional Fusion Network for Hyperspectral and LiDAR Joint Classification

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

CIMFNet: Cross-layer Interaction and Multiscale Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Images

Crossmodal Sequential Interaction Network for Hyperspectral and LiDAR Data Joint Classification

An Efficient Cross-Modality Self-Calibrated Network for Hyperspectral and Multispectral Image Fusion

Dual-Branch Feature Fusion Network Based Cross-Modal Enhanced CNN and Transformer for Hyperspectral and LiDAR Classification

Multifrequency Graph Convolutional Network With Cross-Modality Mutual Enhancement for Multisource Remote Sensing Data Classification

CMR-net: A cross modality reconstruction network for multi-modality remote sensing classification

CMMSNet:A Multi-modal Semantic Segmentation Network for Rooftop Extraction Based on SAR and Optical Images

Deep Multimodal Fusion Network for Semantic Segmentation Using Remote Sensing Image and LiDAR Data

MS2CANet: Multiscale Spatial–Spectral Cross-Modal Attention Network for Hyperspectral Image and LiDAR Classification

Multimodal Semantic Consistency-Based Fusion Architecture Search for Land Cover Classification

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

Multilevel Attention Dynamic-Scale Network for HSI and LiDAR Data Fusion Classification

Semantic Guidance Fusion Network for Cross-Modal Semantic Segmentation

Multimodal Semantic Collaborative Classification for Hyperspectral Images and LiDAR Data

MultiSenseSeg: A Cost-Effective Unified Multimodal Semantic Segmentation Model for Remote Sensing

Multimodal Hyperspectral Image Classification via Interconnected Fusion

Joint Classification of Hyperspectral and LiDAR Data Using Height Information Guided Hierarchical Fusion-and-Separation Network