Abstract:Multispectral person detection aims at automatically localizing humans in images that consist of multiple spectral bands. Usually, the visual-optical (VIS) and the thermal infrared (IR) spectra are combined to achieve higher robustness for person detection especially in insufficiently illuminated scenes. This paper focuses on analyzing existing detection approaches for their generalization ability. Generalization is a key feature for machine learning based detection algorithms that are supposed to perform well across different datasets. Inspired by recent literature regarding person detection in the VIS spectrum, we perform a cross-validation study to empirically determine the most promising dataset to train a well-generalizing detector. Therefore, we pick one reference Deep Convolutional Neural Network (DCNN) architecture and three different multispectral datasets. The Region Proposal Network (RPN) originally introduced for object detection within the popular Faster R-CNN is chosen as a reference DCNN. The reason is that a stand-alone RPN is able to serve as a competitive detector for two-class problems such as person detection. Furthermore, current state-of-the-art approaches initially apply an RPN followed by individual classifiers. The three considered datasets are the KAIST Multispectral Pedestrian Benchmark including recently published improved annotations for training and testing, the Tokyo Multi-spectral Semantic Segmentation dataset, and the OSU Color-Thermal dataset including recently released annotations. The experimental results show that the KAIST Multispectral Pedestrian Benchmark with its improved annotations provides the best basis to train a DCNN with good generalization ability compared to the other two multispectral datasets. On average, this detection model achieves a log-average Miss Rate (MR) of 29.74 % evaluated on the reasonable test subsets of the three datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **the generalization ability in multispectral pedestrian detection**. Specifically, the author focuses on how to train a pedestrian detection model that can perform well on different datasets, especially in scenarios with insufficient illumination by combining visible light (VIS) and thermal infrared (IR) spectra to improve the robustness of detection. ### Background and Problem Description of the Paper 1. **Importance of Multispectral Pedestrian Detection**: - Pedestrian detection is an important research direction in the field of computer vision and is widely used in security, surveillance, autonomous driving and other scenarios. - Traditional pedestrian detection methods based on visible - light images are prone to false positives or false negatives in insufficient illumination or complex environments. - Thermal infrared images can provide additional information, especially in low - light conditions. Therefore, combining visible - light and thermal - infrared spectra can significantly improve detection performance. 2. **The Key Role of Generalization Ability**: - Generalization ability refers to the performance of a machine - learning model on unseen data. For a pedestrian - detection model, this means that it should not only perform well on the training dataset but also maintain high performance on other datasets. - Many existing detection models perform excellently on specific datasets, but their performance will decline significantly on data collected by different environments or different cameras. 3. **Research Objectives**: - Analyze existing detection methods and evaluate their generalization abilities. - Identify the dataset most suitable for training a multispectral pedestrian - detection model with good generalization ability. - Use the Region Proposal Network (RPN) as a reference model because it performs well in binary - classification problems (such as pedestrian detection) and is the basis of the current state - of - the - art multispectral - fusion methods. ### Experimental Design - **Selected Datasets**: - KAIST Multispectral Pedestrian Benchmark: It contains improved annotation data for training and testing. - Tokyo Multi - spectral Semantic Segmentation dataset: It provides rich multispectral data. - OSU Color - Thermal dataset: New annotation data has been recently released. - **Experimental Methods**: - Use RPN as the base model to conduct a cross - validation study. - Test the influence of different datasets on pre - training and fine - tuning. - Evaluate the generalization ability of the model on different datasets, mainly by calculating the log - average Miss Rate (MR). ### Main Findings - **KAIST Multispectral Pedestrian Benchmark**: Under the improved annotation data, the trained model has the best generalization ability, with an average miss rate of 29.74%. - **Generalization Ability across Datasets**: By comparing the effects of different datasets, it is found that the KAIST dataset performs the most stably and excellently on multiple test subsets. In conclusion, this paper aims to solve the problem of insufficient generalization ability of models in multispectral pedestrian detection. Through systematically evaluating different datasets and methods, it provides valuable references for future research.

Generalization ability of region proposal networks for multispectral person detection

Modality-transfer Generative Adversarial Network and Dual-Level Unified Latent Representation for Visible Thermal Person Re-Identification

Multispectral Pedestrian Detection via Simultaneous Detection and Segmentation

Locality guided cross-modal feature aggregation and pixel-level fusion for multispectral pedestrian detection

Learning a Dynamic Cross-Modal Network for Multispectral Pedestrian Detection

Multispectral pedestrian detection based on feature complementation and enhancement

Pedestrian detection with unsupervised multispectral feature learning using deep neural networks

Attention-Guided Region Proposal Network for Pedestrian Detection

Multispectral Deep Neural Networks for Pedestrian Detection

A Fast RetinaNet Fusion Framework for Multi-Spectral Pedestrian Detection

Transformer fusion and histogram layer multispectral pedestrian detection network

The Cross-Modality Disparity Problem in Multispectral Pedestrian Detection.

A Multi-Scale Spatial Attention Region Proposal Network for High-Resolution Optical Remote Sensing Imagery

Illumination-aware Faster R-CNN for Robust Multispectral Pedestrian Detection

M3D-RPN: Monocular 3D Region Proposal Network for Object Detection

MFMANet: a multispectral pedestrian detection network using multi-resolution RGB feature reuse with multi-scale FIR attentions

Weakly Aligned Cross-Modal Learning for Multispectral Pedestrian Detection

Deep Adaptive Proposal Network for Object Detection in Optical Remote Sensing Images

Revisiting Faster R-Cnn: A Deeper Look At Region Proposal Network

Two-Stage Pedestrian Detection Model Using a New Classification Head for Domain Generalization

Multiscale Cross-modal Homogeneity Enhancement and Confidence-aware Fusion for Multispectral Pedestrian Detection