CMMSNet:A Multi-modal Semantic Segmentation Network for Rooftop Extraction Based on SAR and Optical Images

Zhengwei Shen,Yongheng Shang,Xiaoyu Zhang,Jianwei Yin,Jun Han,Chao Cai
DOI: https://doi.org/10.1109/agro-geoinformatics262780.2024.10660703
2024-01-01
Abstract:Distributed rooftop photovoltaic systems hold immense potential for renewable energy generation, and accurate extraction of building roofs from high-resolution remote sensing imagery is crucial for their development. While current semantic segmentation methods primarily rely on single-modality optical images, Synthetic Aperture Radar (SAR) offers complementary ground information that can significantly enhance segmentation accuracy. However, the modality disparities arising from different imaging mechanisms pose challenges in feature fusion between SAR and optical images, existing approaches rely on simplistic fusion methods to exploit the complementary information of each modality, ignoring the correlative information between the different modalities during feature extraction, this results in an insufficient integration of complementary information.To address these challenges, we introduce CMMSNet, a novel multi-modal fusion semantic segmentation network specifically designed for building roof extraction. The main architecture of CMMSNet is constituted by three three core modules: the feature extraction encoder module, the heterogeneous modality alignment module, and the modality fusion decoder module. Initially, dual independent pyramid-structured encoders are employed by CMMSNet to separately extract feature pyramids from SAR and optical images at various scales, this strategy is intended to capture multi-scale semantic contexts and address the issue of large spatial scale variations among different objects in remote sensing images. Furthermore, an Adaptive Feature Alignment Module (AFAM) is introduced, tasked with identifying correlative information between the two modalities from a spatial dimension and aligning the modal features accordingly, this process is crucial for facilitating cross-modal learning and in enhancing the feature representation of each modality. In addition, a Cross-Modal Multi-Scale Feature Fusion (CMMSFF) module is designed to effectively integrates multi-scale and multi-modal heterogeneous features from both modalities, this module employs a channel self-attention mechanism, which adaptively fuses discriminative features by applying weights to each modality and selectively discarding irrelevant components, thus enhancing the selection and fusion of key channels within the multimodal features set. This innovative approach allows us to harness the complementary information provided by SAR and optical images, enhancing the overall segmentation performance. A series of comprehensive experiments conducted on the DFC23 dataset demonstrate that our proposed CMMSNet outperforms other existing mainstream semantic segmentation methods in both stability and effectiveness, including both single-modal and multi-modal approaches. This achievement sets a new benchmark for the extraction of building rooftop through the use of multi-modal remote sensing images. Our findings highlight the importance of leveraging multi-modal data fusion in addressing real-world challenges in remote sensing image analysis and offer valuable insights for future research in this domain.
What problem does this paper attempt to address?