Abstract:Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust in the open world. To fully exploit the different modalities, we present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper. Unlike prior CNNs-based works, guided by the transformer scheme, our network learns long-range dependencies and integrates global contextual information in the feature extraction stage. More importantly, by leveraging the self attention of the transformer, the network can naturally carry out simultaneous intra-modality and inter-modality fusion, and robustly capture the latent interactions between RGB and Thermal domains, thereby significantly improving the performance of multispectral object detection. Extensive experiments and ablation studies on multiple datasets demonstrate that our approach is effective and achieves state-of-the-art detection performance. Our code and models are available at <a class="link-external link-https" href="https://github.com/DocF/multispectral-object-detection" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively fuse information of different modalities in multispectral object detection so as to improve the perception ability, reliability and robustness of the detection algorithm. Specifically, the paper focuses on how to design an effective cross - modality fusion mechanism in order to make full use of the complementarity between different modalities, thereby achieving more reliable and robust object detection applications in the open world. Traditional methods are mainly based on convolutional neural networks (CNN), but these methods have limitations in cross - modality fusion or modality interaction, especially in utilizing the inherent complementarity between modalities. Therefore, this paper proposes a new Transformer - based cross - modality fusion method - Cross - Modality Fusion Transformer (CFT), aiming to overcome the shortcomings of existing methods and significantly improve the performance of multispectral object detection. ### Main contributions of the paper 1. **Introduction of a new two - stream backbone network**: Under the guidance of the Transformer framework, this network enhances one modality through another modality, thereby achieving more effective feature extraction. 2. **Proposing a simple and effective CFT module**: This module can not only fuse intra - modality and inter - modality information simultaneously, but also provides theoretical insights, proving its effectiveness in multispectral object detection. 3. **Experimental verification**: A large number of experiments show that the proposed method has achieved the state - of - the - art detection performance on three public datasets. ### Specific methods for solving problems - **Feature extraction**: The paper redesigns the feature extraction network of YOLOv5 to make it a two - stream backbone network and embeds the CFT module to promote the fusion and interaction between modalities. - **Self - attention mechanism**: By using the self - attention mechanism of Transformer, the CFT module can naturally perform intra - modality and inter - modality feature fusion and capture the potential interaction between RGB and thermal imaging domains. - **Computation optimization**: In order to reduce the computational cost, the paper adopts the global average pooling technique to down - sample the feature map to a lower and fixed resolution, and then up - sample it to the original resolution through bilinear interpolation. ### Experimental results - **Quantitative analysis**: The experimental results on the three datasets of FLIR, LLVIP and VEDAI show that the CFT module significantly improves the detection performance. Especially on the VEDAI dataset, the mAP75 index is increased by 18.2% and the mAP index is increased by 9.2%. - **Qualitative analysis**: By visualizing the feature maps and detection results, it can be seen that the CFT module performs excellently in dealing with densely occluded objects and reduces the cases of false detection and missed detection. In conclusion, by introducing the Transformer - based CFT module, this paper effectively solves the problem of cross - modality information fusion in multispectral object detection and significantly improves the detection performance.

Cross-Modality Fusion Transformer for Multispectral Object Detection

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery

ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection

Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images

Transformer fusion and histogram layer multispectral pedestrian detection network

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

A multimodal hyper-fusion transformer for remote sensing image classification

Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond

Cross Teaching Between Single-Spectral and Multi-Spectral Detection Transformers for Remote Sensing Object Detection

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

Background-Aware Cross-Attention Multiscale Fusion for Multispectral Object Detection

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

Multimodal Fusion Transformer for Remote Sensing Image Classification

Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion

MCT-Net: Multi-hierarchical cross transformer for hyperspectral and multispectral image fusion

Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks

Cross-modal multi-scale feature fusion-based RGB-T saliency object detection method

HATF: Multi-Modal Feature Learning for Infrared and Visible Image Fusion via Hybrid Attention Transformer

MCANet: Hierarchical cross-fusion lightweight transformer based on multi-ConvHead attention for object detection