Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection

Xue Zhang,Si-Yuan Cao,Fang Wang,Runmin Zhang,Zhe Wu,Xiaohan Zhang,Xiaokai Bai,Hui-Liang Shen
DOI: https://doi.org/10.1109/TIV.2024.3462488
2024-09-19
Abstract:Most recent multispectral object detectors employ a two-branch structure to extract features from RGB and thermal images. While the two-branch structure achieves better performance than a single-branch structure, it overlooks inference efficiency. This conflict is increasingly aggressive, as recent works solely pursue higher performance rather than both performance and efficiency. In this paper, we address this issue by improving the performance of efficient single-branch structures. We revisit the reasons causing the performance gap between these structures. For the first time, we reveal the information interference problem in the naive early-fusion strategy adopted by previous single-branch structures. Besides, we find that the domain gap between multispectral images, and weak feature representation of the single-branch structure are also key obstacles for performance. Focusing on these three problems, we propose corresponding solutions, including a novel shape-priority early-fusion strategy, a weakly supervised learning method, and a core knowledge distillation technique. Experiments demonstrate that single-branch networks equipped with these three contributions achieve significant performance enhancements while retaining high efficiency. Our code will be available at \url{<a class="link-external link-https" href="https://github.com/XueZ-phd/Efficient-RGB-T-Early-Fusion-Detection" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the conflict between performance and inference efficiency in multispectral object detection. Specifically: 1. **Conflict between performance and efficiency**: Current multispectral object detectors typically use a dual-branch structure to extract features from RGB images and thermal images. Although this structure outperforms single-branch structures in terms of performance, it neglects inference efficiency. Recent research has mainly pursued higher performance while ignoring efficiency issues. 2. **Problems with early fusion strategies**: Early fusion strategies, although simple and suitable for edge device deployment, have lower performance. The authors found through preliminary research that simple early fusion strategies do not always achieve better performance than single-modal inputs, especially for objects requiring color recognition (such as traffic lights). 3. **Key obstacles**: - **Information interference issue**: In simple early fusion strategies, information from different modalities may suppress each other, leading to the loss of important information. - **Domain gap issue**: There is a domain gap between RGB images and thermal images, which can cause a shift in representation distribution. - **Weak feature representation issue**: Single-branch structures have weaker feature representation capabilities due to fewer parameters and simpler fusion modules. To overcome these issues, the authors propose the following solutions: - **Shape-Prioritized Early Fusion Module (ShaPE)**: This module fuses multispectral images based on the saliency of object shapes, reducing information interference. - **Weakly Supervised Learning Method**: This method is introduced to reduce the domain gap and improve semantic localization capabilities. - **Core Knowledge Distillation Technique (CoreKD)**: This technique enhances the feature representation capability of single-branch networks by transferring only the most critical knowledge through knowledge distillation. Experimental results show that these improvements significantly enhance the performance of early fusion strategies while maintaining high efficiency.