Abstract:Most recent multispectral object detectors employ a two-branch structure to extract features from RGB and thermal images. While the two-branch structure achieves better performance than a single-branch structure, it overlooks inference efficiency. This conflict is increasingly aggressive, as recent works solely pursue higher performance rather than both performance and efficiency. In this paper, we address this issue by improving the performance of efficient single-branch structures. We revisit the reasons causing the performance gap between these structures. For the first time, we reveal the information interference problem in the naive early-fusion strategy adopted by previous single-branch structures. Besides, we find that the domain gap between multispectral images, and weak feature representation of the single-branch structure are also key obstacles for performance. Focusing on these three problems, we propose corresponding solutions, including a novel shape-priority early-fusion strategy, a weakly supervised learning method, and a core knowledge distillation technique. Experiments demonstrate that single-branch networks equipped with these three contributions achieve significant performance enhancements while retaining high efficiency. Our code will be available at \url{<a class="link-external link-https" href="https://github.com/XueZ-phd/Efficient-RGB-T-Early-Fusion-Detection" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper attempts to address the conflict between performance and inference efficiency in multispectral object detection. Specifically: 1. **Conflict between performance and efficiency**: Current multispectral object detectors typically use a dual-branch structure to extract features from RGB images and thermal images. Although this structure outperforms single-branch structures in terms of performance, it neglects inference efficiency. Recent research has mainly pursued higher performance while ignoring efficiency issues. 2. **Problems with early fusion strategies**: Early fusion strategies, although simple and suitable for edge device deployment, have lower performance. The authors found through preliminary research that simple early fusion strategies do not always achieve better performance than single-modal inputs, especially for objects requiring color recognition (such as traffic lights). 3. **Key obstacles**: - **Information interference issue**: In simple early fusion strategies, information from different modalities may suppress each other, leading to the loss of important information. - **Domain gap issue**: There is a domain gap between RGB images and thermal images, which can cause a shift in representation distribution. - **Weak feature representation issue**: Single-branch structures have weaker feature representation capabilities due to fewer parameters and simpler fusion modules. To overcome these issues, the authors propose the following solutions: - **Shape-Prioritized Early Fusion Module (ShaPE)**: This module fuses multispectral images based on the saliency of object shapes, reducing information interference. - **Weakly Supervised Learning Method**: This method is introduced to reduce the domain gap and improve semantic localization capabilities. - **Core Knowledge Distillation Technique (CoreKD)**: This technique enhances the feature representation capability of single-branch networks by transferring only the most critical knowledge through knowledge distillation. Experimental results show that these improvements significantly enhance the performance of early fusion strategies while maintaining high efficiency.

Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection

ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection

Exploiting fusion architectures for multispectral pedestrian detection and segmentation

Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery

Discriminative feature fusion for RGB-D salient object detection

Cross-Modality Fusion Transformer for Multispectral Object Detection

Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks

Removal and Selection: Improving RGB-Infrared Object Detection via Coarse-to-Fine Fusion

RGB-X Object Detection via Scene-Specific Fusion Modules

Improving RGB-Infrared Object Detection by Reducing Cross-Modality Redundancy

Revisiting Feature Fusion for RGB-T Salient Object Detection

Multispectral Object Detection Based on Multilevel Feature Fusion and Dual Feature Modulation

Multispectral Deep Neural Network Fusion Method for Low-Light Object Detection

Employing Bilinear Fusion and Saliency Prior Information for RGB-D Salient Object Detection

An Adaptive Attention Fusion Mechanism Convolutional Network for Object Detection in Remote Sensing Images

A Fast RetinaNet Fusion Framework for Multi-Spectral Pedestrian Detection

Multi-Modal Object Detection Method Based on Dual-Branch Asymmetric Attention Backbone and Feature Fusion Pyramid Network

Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

Middle-Level Feature Fusion for Lightweight RGB-D Salient Object Detection

DCFusion: A Dual-Frequency Cross-Enhanced Fusion Network for Infrared and Visible Image Fusion.

SparseFusion: Efficient Sparse Multi-Modal Fusion Framework for Long-Range 3D Perception