Abstract:Cross-modality fusing complementary information of multispectral remote sensing image pairs can improve the perception ability of detection algorithms, making them more robust and reliable for a wider range of applications, such as nighttime detection. Compared with prior methods, we think different features should be processed specifically, the modality-specific features should be retained and enhanced, while the modality-shared features should be cherry-picked from the RGB and thermal IR modalities. Following this idea, a novel and lightweight multispectral feature fusion approach with joint common-modality and differential-modality attentions are proposed, named Cross-Modality Attentive Feature Fusion (CMAFF). Given the intermediate feature maps of RGB and IR images, our module parallel infers attention maps from two separate modalities, common- and differential-modality, then the attention maps are multiplied to the input feature map respectively for adaptive feature enhancement or selection. Extensive experiments demonstrate that our proposed approach can achieve the state-of-the-art performance at a low computation cost.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively fuse information of different modalities to improve the perception ability and robustness of detection algorithms when performing object detection in multispectral remote sensing images, especially under complex conditions such as night - time detection. Traditional multispectral information fusion methods (such as simple splicing, element - by - element addition and element - by - element multiplication) cannot fully utilize the complementary information between different modalities, and may even increase the difficulty of network learning, resulting in performance degradation. Therefore, the author proposes a new lightweight multispectral feature fusion method - Cross - Modality Attentive Feature Fusion (CMAFF), aiming to more effectively utilize the complementary information in multispectral images through the attention mechanisms of joint common modality and differential modality. Specifically, the main contributions of the paper include: 1. **Proposing the CMAFF module**: This module can fuse the complementary information in multispectral remote sensing images, and enhance and select features respectively through the attention mechanisms of common modality and differential modality. 2. **Designing the two - stream object detection network YOLOFusion**: Combined with the lightweight YOLOv5 - small detector, it verifies the performance of the CMAFF module in multispectral remote sensing images. 3. **Extensive ablation experiments**: Through a large number of experiments, the effectiveness of the CMAFF module is verified, and the influence of different arrangements of attention modules on performance is analyzed. The core idea of the paper is to process the features of different modalities by the divide - and - conquer method, retain and enhance the modality - specific features, and select the shared features from the two modalities at the same time. This can more effectively utilize the information in multispectral images and improve the accuracy and robustness of object detection.

Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery

ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection

Cross-Modality Fusion Transformer for Multispectral Object Detection

Background-Aware Cross-Attention Multiscale Fusion for Multispectral Object Detection

Object Detection in Multispectral Remote Sensing Images Based on Cross-Modal Cross-Attention

Locality guided cross-modal feature aggregation and pixel-level fusion for multispectral pedestrian detection

Discriminative unimodal feature selection and fusion for RGB-D salient object detection

An Adaptive Attention Fusion Mechanism Convolutional Network for Object Detection in Remote Sensing Images

Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network

Enhancing target detection accuracy through cross-modal spatial perception and dual-modality fusion

CMEFusion: Cross-Modal Enhancement and Fusion of FIR and Visible Images

CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach

Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images

RGB-D Salient Object Detection Based on Cross-Modal and Cross-Level Feature Fusion

Multi-Modal Target Detection Method Based on Adaptive Feature Search

Cross-modal multi-scale feature fusion-based RGB-T saliency object detection method

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

Adaptively Attentional Feature Fusion Oriented to Multiscale Object Detection in Remote Sensing Images

Weakly Aligned Feature Fusion for Multimodal Object Detection

RGB-D salient object detection via cross-modal joint feature extraction and low-bound fusion loss