Abstract:Distributed rooftop photovoltaic systems hold immense potential for renewable energy generation, and accurate extraction of building roofs from high-resolution remote sensing imagery is crucial for their development. While current semantic segmentation methods primarily rely on single-modality optical images, Synthetic Aperture Radar (SAR) offers complementary ground information that can significantly enhance segmentation accuracy. However, the modality disparities arising from different imaging mechanisms pose challenges in feature fusion between SAR and optical images, existing approaches rely on simplistic fusion methods to exploit the complementary information of each modality, ignoring the correlative information between the different modalities during feature extraction, this results in an insufficient integration of complementary information.To address these challenges, we introduce CMMSNet, a novel multi-modal fusion semantic segmentation network specifically designed for building roof extraction. The main architecture of CMMSNet is constituted by three three core modules: the feature extraction encoder module, the heterogeneous modality alignment module, and the modality fusion decoder module. Initially, dual independent pyramid-structured encoders are employed by CMMSNet to separately extract feature pyramids from SAR and optical images at various scales, this strategy is intended to capture multi-scale semantic contexts and address the issue of large spatial scale variations among different objects in remote sensing images. Furthermore, an Adaptive Feature Alignment Module (AFAM) is introduced, tasked with identifying correlative information between the two modalities from a spatial dimension and aligning the modal features accordingly, this process is crucial for facilitating cross-modal learning and in enhancing the feature representation of each modality. In addition, a Cross-Modal Multi-Scale Feature Fusion (CMMSFF) module is designed to effectively integrates multi-scale and multi-modal heterogeneous features from both modalities, this module employs a channel self-attention mechanism, which adaptively fuses discriminative features by applying weights to each modality and selectively discarding irrelevant components, thus enhancing the selection and fusion of key channels within the multimodal features set. This innovative approach allows us to harness the complementary information provided by SAR and optical images, enhancing the overall segmentation performance. A series of comprehensive experiments conducted on the DFC23 dataset demonstrate that our proposed CMMSNet outperforms other existing mainstream semantic segmentation methods in both stability and effectiveness, including both single-modal and multi-modal approaches. This achievement sets a new benchmark for the extraction of building rooftop through the use of multi-modal remote sensing images. Our findings highlight the importance of leveraging multi-modal data fusion in addressing real-world challenges in remote sensing image analysis and offer valuable insights for future research in this domain.

Building Footprint Extraction of Coastal Cities from Multi-source Aerial Images Based on Semantic Segmentation

An Image Segmentation Method Based on Transformer and Multi-Scale Feature Fusion for UAV Marine Environment Monitoring

Semantic Segmentation-Based Building Footprint Extraction Using Very High-Resolution Satellite Images and Multi-Source GIS Data

C1 dissociation. Spontaneous generation in human serum of a trimer complex containing C1 inactivator, activated C1r, and zymogen C1s.

MSFTrans: a multi-task frequency-spatial learning transformer for building extraction from high spatial resolution remote sensing images

Transformer-based semantic segmentation for large-scale building footprint extraction from very-high resolution satellite images

A Lightweight Building Extraction Approach for Contour Recovery in Complex Urban Environments

A Transformer-based Multi-Modal Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery

FusionHeightNet: A Multi-Level Cross-Fusion Method from Multi-Source Remote Sensing Images for Urban Building Height Estimation

UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery

Multi-scale attention integrated hierarchical networks for high-resolution building footprint extraction

Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images

CMMSNet:A Multi-modal Semantic Segmentation Network for Rooftop Extraction Based on SAR and Optical Images

Simultaneous Extraction of Spatial and Attributional Building Information Across Large-Scale Urban Landscapes from High-Resolution Satellite Imagery

Extracting Building Footprint From Remote Sensing Images by an Enhanced Vision Transformer Network

STransU2Net: Transformer based hybrid model for building segmentation in detailed satellite imagery

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

Architecture of Deep Convolutional Encoder-Decoder Networks for Building Footprint Semantic Segmentation

Extracting Buildings from Remote Sensing Images Using a Multitask Encoder-Decoder Network with Boundary Refinement

Asymmetric Network Combining CNN and Transformer for Building Extraction from Remote Sensing Images

A Semantic Segmentation Network for Urban-Scale Building Footprint Extraction Using RGB Satellite Imagery