Abstract:Distributed rooftop photovoltaic systems hold immense potential for renewable energy generation, and accurate extraction of building roofs from high-resolution remote sensing imagery is crucial for their development. While current semantic segmentation methods primarily rely on single-modality optical images, Synthetic Aperture Radar (SAR) offers complementary ground information that can significantly enhance segmentation accuracy. However, the modality disparities arising from different imaging mechanisms pose challenges in feature fusion between SAR and optical images, existing approaches rely on simplistic fusion methods to exploit the complementary information of each modality, ignoring the correlative information between the different modalities during feature extraction, this results in an insufficient integration of complementary information.To address these challenges, we introduce CMMSNet, a novel multi-modal fusion semantic segmentation network specifically designed for building roof extraction. The main architecture of CMMSNet is constituted by three three core modules: the feature extraction encoder module, the heterogeneous modality alignment module, and the modality fusion decoder module. Initially, dual independent pyramid-structured encoders are employed by CMMSNet to separately extract feature pyramids from SAR and optical images at various scales, this strategy is intended to capture multi-scale semantic contexts and address the issue of large spatial scale variations among different objects in remote sensing images. Furthermore, an Adaptive Feature Alignment Module (AFAM) is introduced, tasked with identifying correlative information between the two modalities from a spatial dimension and aligning the modal features accordingly, this process is crucial for facilitating cross-modal learning and in enhancing the feature representation of each modality. In addition, a Cross-Modal Multi-Scale Feature Fusion (CMMSFF) module is designed to effectively integrates multi-scale and multi-modal heterogeneous features from both modalities, this module employs a channel self-attention mechanism, which adaptively fuses discriminative features by applying weights to each modality and selectively discarding irrelevant components, thus enhancing the selection and fusion of key channels within the multimodal features set. This innovative approach allows us to harness the complementary information provided by SAR and optical images, enhancing the overall segmentation performance. A series of comprehensive experiments conducted on the DFC23 dataset demonstrate that our proposed CMMSNet outperforms other existing mainstream semantic segmentation methods in both stability and effectiveness, including both single-modal and multi-modal approaches. This achievement sets a new benchmark for the extraction of building rooftop through the use of multi-modal remote sensing images. Our findings highlight the importance of leveraging multi-modal data fusion in addressing real-world challenges in remote sensing image analysis and offer valuable insights for future research in this domain.

CMMSNet:A Multi-modal Semantic Segmentation Network for Rooftop Extraction Based on SAR and Optical Images

A Transformer-based Multi-Modal Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images

OPT-SAR-MS2Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images

ACMFNet: Attention-Based Cross-Modal Fusion Network for Building Extraction of Remote Sensing Images

MCAFNet: A Multiscale Channel Attention Fusion Network for Semantic Segmentation of Remote Sensing Images

Deep Multimodal Fusion Network for Semantic Segmentation Using Remote Sensing Image and LiDAR Data

Local–Global Multiscale Fusion Network for Semantic Segmentation of Buildings in SAR Imagery

Efficient Multi-scale Network for Semantic Segmentation of fine-Resolution Remotely Sensed Images

CIMFNet: Cross-layer Interaction and Multiscale Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Images

HCRB-MSAN: Horizontally Connected Residual Blocks-Based Multiscale Attention Network for Semantic Segmentation of Buildings in HSR Remote Sensing Images

Densely Based Multi-Scale and Multi-Modal Fully Convolutional Networks for High-Resolution Remote-Sensing Image Semantic Segmentation

Progressive fusion learning: A multimodal joint segmentation framework for building extraction from optical and SAR images

An Attention-Fused Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Multi-Scale Feature Fusion Attention Network for Building Extraction in Remote Sensing Images

Rooftop PV Segmenter: A Size-Aware Network for Segmenting Rooftop Photovoltaic Systems from High-Resolution Imagery

Scale-Aware Neural Network for Semantic Segmentation of Multi-Resolution Remote Sensing Images

BOMSC-Net: Boundary Optimization and Multi-Scale Context Awareness Based Building Extraction From High-Resolution Remote Sensing Imagery