Abstract:Multispectral object detection, utilizing RGB and TIR (thermal infrared) modalities, is widely recognized as a challenging task. It requires not only the effective extraction of features from both modalities and robust fusion strategies, but also the ability to address issues such as spectral discrepancies, spatial misalignment, and environmental dependencies between RGB and TIR images. These challenges significantly hinder the generalization of multispectral detection systems across diverse scenarios. Although numerous studies have attempted to overcome these limitations, it remains difficult to clearly distinguish the performance gains of multispectral detection systems from the impact of these "optimization techniques". Worse still, despite the rapid emergence of high-performing single-modality detection models, there is still a lack of specialized training techniques that can effectively adapt these models for multispectral detection tasks. The absence of a standardized benchmark with fair and consistent experimental setups also poses a significant barrier to evaluating the effectiveness of new approaches. To this end, we propose the first fair and reproducible benchmark specifically designed to evaluate the training "techniques", which systematically classifies existing multispectral object detection methods, investigates their sensitivity to hyper-parameters, and standardizes the core configurations. A comprehensive evaluation is conducted across multiple representative multispectral object detection datasets, utilizing various backbone networks and detection frameworks. Additionally, we introduce an efficient and easily deployable multispectral object detection framework that can seamlessly optimize high-performing single-modality models into dual-modality models, integrating our advanced training techniques.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in multispectral object detection, which are still challenging in current research and technology applications. Specifically, the paper mainly focuses on the following aspects: 1. **Effective utilization of bimodal data**: - Multispectral object detection requires the simultaneous processing of visible light (RGB) and thermal infrared (TIR) images, which increases the complexity of feature fusion and may lead to information loss or failure to fully utilize the advantages of the two modalities. - Registration discrepancies between modalities and the lack of modality - specific enhancement strategies also limit the model performance. 2. **Effective optimization strategies for converting high - performance unimodal models into bimodal models**: - Although many powerful unimodal object detection frameworks have emerged in recent years, there are no robust methods to effectively adapt these models to multispectral detection tasks. - This problem is particularly prominent because when unimodal models are directly applied to multispectral tasks, their potential may not be fully exploited, and in some cases, they may perform worse than optimized unimodal models. 3. **Lack of standardized benchmark tests**: - Currently, there is a lack of a standardized, fair, and consistent experimental setup to evaluate the effectiveness of new methods and techniques. - The hyper - parameter configurations (such as hidden - layer dimensions, learning rates, weight decays, etc.) among different studies are highly inconsistent, making it difficult to draw fair and reliable conclusions. To address these challenges, the paper proposes the following solutions: - **Multimodal feature fusion**: Introduce advanced multimodal feature fusion techniques to effectively integrate visible light and infrared data and enhance the model's feature representation ability in complex environments. - **Bimodal data augmentation**: Adopt modality - specific data augmentation strategies to improve the model's robustness in different environments and complex scenarios. - **Alignment optimization**: Through precise alignment techniques, reduce the spatial misalignment between visible light and infrared data, significantly improving the performance of low - light object detection and multimodal information fusion. - **Optimize unimodal models for bimodal tasks**: Provide new benchmarks and training techniques to optimize high - performance unimodal models into bimodal detection models, making unimodal models superior to complex bimodal architectures in some cases. Through these improvements, the paper aims to establish a fair and repeatable benchmarking platform, systematically classify existing multispectral object detection methods, investigate their sensitivity to hyper - parameters, and standardize core configurations. In addition, the paper also conducts extensive experimental validations, demonstrating the effectiveness of these techniques on multiple representative multispectral object detection datasets.

Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks

Multispectral Object Detection Based on Multilevel Feature Fusion and Dual Feature Modulation

Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection

Improving RGB-Infrared Object Detection by Reducing Cross-Modality Redundancy

Object Detection in Multispectral Remote Sensing Images Based on Cross-Modal Cross-Attention

Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems

RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

DMM: Disparity-guided Multispectral Mamba for Oriented Object Detection in Remote Sensing

Research on 24-Hour Dense Crowd Counting and Object Detection System Based on Multimodal Image Optimization Feature Fusion

A Task-Balanced Multiscale Adaptive Fusion Network for Object Detection in Remote Sensing Images

Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery

Cross-Modality Fusion Transformer for Multispectral Object Detection

Cross Teaching Between Single-Spectral and Multi-Spectral Detection Transformers for Remote Sensing Object Detection

Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks

ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection

Real-time dense small object detection algorithm based on multi-modal tea shoots

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training

MMCMOO: A Novel Multispectral Pansharpening Method

An Effective and Lightweight Hybrid Network for Object Detection in Remote Sensing Images

A Unified RGB-T Saliency Detection Benchmark: Dataset, Baselines, Analysis and A Novel Approach

CDTD: A Large-Scale Cross-Domain Benchmark for Instance-Level Image-to-Image Translation and Domain Adaptive Object Detection.