Cross-Modality Fusion Transformer for Multispectral Object Detection

Fang Qingyun,Han Dapeng,Wang Zhaokui
DOI: https://doi.org/10.48550/arXiv.2111.00273
2022-10-04
Abstract:Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust in the open world. To fully exploit the different modalities, we present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper. Unlike prior CNNs-based works, guided by the transformer scheme, our network learns long-range dependencies and integrates global contextual information in the feature extraction stage. More importantly, by leveraging the self attention of the transformer, the network can naturally carry out simultaneous intra-modality and inter-modality fusion, and robustly capture the latent interactions between RGB and Thermal domains, thereby significantly improving the performance of multispectral object detection. Extensive experiments and ablation studies on multiple datasets demonstrate that our approach is effective and achieves state-of-the-art detection performance. Our code and models are available at <a class="link-external link-https" href="https://github.com/DocF/multispectral-object-detection" rel="external noopener nofollow">this https URL</a>.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively fuse information of different modalities in multispectral object detection so as to improve the perception ability, reliability and robustness of the detection algorithm. Specifically, the paper focuses on how to design an effective cross - modality fusion mechanism in order to make full use of the complementarity between different modalities, thereby achieving more reliable and robust object detection applications in the open world. Traditional methods are mainly based on convolutional neural networks (CNN), but these methods have limitations in cross - modality fusion or modality interaction, especially in utilizing the inherent complementarity between modalities. Therefore, this paper proposes a new Transformer - based cross - modality fusion method - Cross - Modality Fusion Transformer (CFT), aiming to overcome the shortcomings of existing methods and significantly improve the performance of multispectral object detection. ### Main contributions of the paper 1. **Introduction of a new two - stream backbone network**: Under the guidance of the Transformer framework, this network enhances one modality through another modality, thereby achieving more effective feature extraction. 2. **Proposing a simple and effective CFT module**: This module can not only fuse intra - modality and inter - modality information simultaneously, but also provides theoretical insights, proving its effectiveness in multispectral object detection. 3. **Experimental verification**: A large number of experiments show that the proposed method has achieved the state - of - the - art detection performance on three public datasets. ### Specific methods for solving problems - **Feature extraction**: The paper redesigns the feature extraction network of YOLOv5 to make it a two - stream backbone network and embeds the CFT module to promote the fusion and interaction between modalities. - **Self - attention mechanism**: By using the self - attention mechanism of Transformer, the CFT module can naturally perform intra - modality and inter - modality feature fusion and capture the potential interaction between RGB and thermal imaging domains. - **Computation optimization**: In order to reduce the computational cost, the paper adopts the global average pooling technique to down - sample the feature map to a lower and fixed resolution, and then up - sample it to the original resolution through bilinear interpolation. ### Experimental results - **Quantitative analysis**: The experimental results on the three datasets of FLIR, LLVIP and VEDAI show that the CFT module significantly improves the detection performance. Especially on the VEDAI dataset, the mAP75 index is increased by 18.2% and the mAP index is increased by 9.2%. - **Qualitative analysis**: By visualizing the feature maps and detection results, it can be seen that the CFT module performs excellently in dealing with densely occluded objects and reduces the cases of false detection and missed detection. In conclusion, by introducing the Transformer - based CFT module, this paper effectively solves the problem of cross - modality information fusion in multispectral object detection and significantly improves the detection performance.