Abstract:Synthetic Aperture Radar (SAR) images have proven to be a valuable cue for multimodal Land Cover Classification (LCC) when combined with RGB images. Most existing studies on cross-modal fusion assume that consistent feature information is necessary between the two modalities, and as a result, they construct networks without adequately addressing the unique characteristics of each modality. In this paper, we propose a novel architecture, named the Asymmetric Semantic Aligning Network (ASANet), which introduces asymmetry at the feature level to address the issue that multi-modal architectures frequently fail to fully utilize complementary features. The core of this network is the Semantic Focusing Module (SFM), which explicitly calculates differential weights for each modality to account for the modality-specific features. Furthermore, ASANet incorporates a Cascade Fusion Module (CFM), which delves deeper into channel and spatial representations to efficiently select features from the two modalities for fusion. Through the collaborative effort of these two modules, the proposed ASANet effectively learns feature correlations between the two modalities and eliminates noise caused by feature differences. Comprehensive experiments demonstrate that ASANet achieves excellent performance on three multimodal datasets. Additionally, we have established a new RGB-SAR multimodal dataset, on which our ASANet outperforms other mainstream methods with improvements ranging from 1.21% to 17.69%. The ASANet runs at 48.7 frames per second (FPS) when the input image is 256x256 pixels. The source code are available at <a class="link-external link-https" href="https://github.com/whu-pzhang/ASANet" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of how to effectively fuse RGB images and Synthetic Aperture Radar (SAR) images in Multimodal Land Cover Classification (LCC). Specifically, the paper focuses on the following points: 1. **Complementarity of multimodal features**: - Existing multimodal fusion methods usually assume that consistent feature information is required between the two modalities, which leads to their failure to fully consider the unique characteristics of each modality when constructing the network. This symmetric fusion method may ignore the complementary features between different modalities, thus affecting the classification performance. 2. **Feature interaction and noise suppression**: - Traditional methods lack direct feature interaction during feature extraction or aggregation, or simple feature concatenation and addition operations may introduce noise, affecting the final classification effect. The paper proposes a new architecture. By introducing the Asymmetric Semantic Aligning Network (ASANet), asymmetry is introduced at the feature level to better utilize complementary features and reduce noise interference. 3. **Insufficient data sets**: - The number of currently publicly available RGB - SAR multimodal data sets is limited, and the quality varies. To make up for this deficiency, the paper constructs a new RGB - SAR multimodal data set, named PIE - RGB - SAR, which contains RGB and SAR images from the Pearl River Delta region in China and is finely labeled. ### Main contributions 1. **Proposing ASANet**: - ASANet effectively calibrates and aligns the information of the two modalities by introducing the Semantic Focusing Module (SFM) and the Cascade Fusion Module (CFM), making full use of the complementarity and consistency of features. 2. **Constructing a new multimodal data set**: - The PIE - RGB - SAR data set is constructed, which contains finely labeled RGB and SAR images and is divided into six types of land cover: urban, road, water, forest, farmland, and others. The data set has cloud - covered areas for evaluating the effectiveness of RGB - SAR fusion under unfavorable observation conditions. 3. **Excellent performance**: - ASANet has achieved state - of - the - art (SOTA) performance on three multimodal data sets. In particular, on the PIE - RGB - SAR data set, ASANet has increased the mean Intersection over Union (mIoU) by 1.21% to 17.69% compared with other methods. ### Method overview 1. **Overall network architecture**: - ASANet adopts a dual - branch encoder and decoder structure. In the encoder part, the two branches respectively extract the features of RGB and SAR images and capture the unique characteristics of each modality. The decoder part adopts the UPerNet design, which is flexibly applicable to different segmentation tasks. 2. **Semantic Focusing Module (SFM)**: - SFM processes multimodal features through the channel attention mechanism and calculates the attention matrix of each branch, focusing on unique and complementary features. Specific steps include: - Obtaining difference features: Obtaining the difference feature maps of RGB and SAR images through pixel - level subtraction operations. - Adaptive fine - grained learning: Performing fine - grained learning through global maximum pooling and multiple convolution modules to generate global difference weights. - Correcting parallel branches: Correcting the original feature maps using the sigmoid function and pixel - level multiplication operations. 3. **Cascade Fusion Module (CFM)**: - CFM learns the deep representations of RGB and SAR features in the channel and spatial dimensions through the cascade attention mechanism, adaptively calibrates and aligns complementary information, and reduces noise interference. Specific steps include: - Alignment and calibration in the channel dimension: Generating channel attention weights through global average pooling and a multi - layer perceptron (MLP).

ASANet: Asymmetric Semantic Aligning Network for RGB and SAR image land cover classification

Attention-based Multi-modal Fusion Network for Semantic Scene Completion.

MFFnet: Multimodal Feature Fusion Network for Synthetic Aperture Radar and Optical Image Land Cover Classification

AANet: Adaptive Attention Networks for Semantic Segmentation of High-Resolution Remote Sensing Imagery

Multi-Modal Fusion Architecture Search for Land Cover Classification Using Heterogeneous Remote Sensing Images.

Semantic Representation Fusion-Based Network for Robust Land Cover Classification in Foggy Conditions.

Learning SAR-Optical Cross Modal Features for Land Cover Classification

Multimodal Semantic Consistency-Based Fusion Architecture Search for Land Cover Classification

GCSANet: A Global Context Spatial Attention Deep Learning Network for Remote Sensing Scene Classification

OPT-SAR-MS2Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images

CFNet: A Cross Fusion Network for Joint Land Cover Classification Using Optical and SAR Images

Multimodal Bilinear Fusion Network with Second-Order Attention-Based Channel Selection for Land Cover Classification

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

Synthetic Aperture Radar Scene Classification Using Multiview Cross Correlation Attention Network

AFANet: A Multibackbone Compatible Feature Fusion Framework for Effective Remote Sensing Object Detection

A Network for Merging SAR Image Sea-Land Segmentation and Coastline Detection Tasks

MTANet: Multitask-Aware Network with Hierarchical Multimodal Fusion for RGB-T Urban Scene Understanding

MSFANet: Multiscale Fusion Attention Network for Road Segmentation of Multispectral Remote Sensing Data

Synthetic Aperture Radar Image Change Detection via Siamese Adaptive Fusion Network

SACANet: scene-aware class attention network for semantic segmentation of remote sensing images

MS2CANet: Multiscale Spatial–Spectral Cross-Modal Attention Network for Hyperspectral Image and LiDAR Classification