Abstract:Pedestrian detection is a critical task for safety-critical systems, but detecting pedestrians is challenging in low-light and adverse weather conditions. Thermal images can be used to improve robustness by providing complementary information to RGB images. Previous studies have shown that multi-modal feature fusion using convolution operation can be effective, but such methods rely solely on local feature correlations, which can degrade the performance capabilities. To address this issue, we propose an attention-based novel fusion network, referred to as INSANet (INtra-INter Spectral Attention Network), that captures global intra- and inter-information. It consists of intra- and inter-spectral attention blocks that allow the model to learn mutual spectral relationships. Additionally, we identified an imbalance in the multispectral dataset caused by several factors and designed an augmentation strategy that mitigates concentrated distributions and enables the model to learn the diverse locations of pedestrians. Extensive experiments demonstrate the effectiveness of the proposed methods, which achieve state-of-the-art performance on the KAIST dataset and LLVIP dataset. Finally, we conduct a regional performance evaluation to demonstrate the effectiveness of our proposed network in various regions.

What problem does this paper attempt to address?

The paper attempts to address the issue of improving the performance of multispectral pedestrian detection under low light and adverse weather conditions. Specifically, the paper focuses on the following key issues: 1. **Multispectral Feature Fusion**: Existing multimodal feature fusion methods mainly rely on convolution operations, which can only capture local feature correlations and are difficult to effectively capture long-range spatial dependencies. This limits the model's performance in complex environments. 2. **Capturing Intra- and Inter-Modal Information**: Existing methods often ignore the relationships between intra-modal features during the fusion process, leading to information loss. The paper proposes a new attention mechanism that enhances feature representation by introducing intra- and inter-modal attention blocks. 3. **Dataset Distribution Imbalance**: In multispectral datasets, the distribution of pedestrian positions is imbalanced, especially in road scenes where pedestrians appear concentrated in certain specific areas. This distribution imbalance can affect the model's generalization ability. The paper designs a data augmentation strategy to alleviate this problem through translation transformations. To address the above issues, the paper proposes the following methods: - **INtra-INter Spectral Attention Network (INSANet)**: This is a novel fusion network based on an attention mechanism that can effectively capture global information within and between modalities. INSANet consists of intra-modal attention blocks and inter-modal attention blocks, enhancing feature representation through self-attention and cross-attention mechanisms, and promoting complementary information exchange between different modalities. - **Data Augmentation Strategy**: To alleviate the problem of imbalanced pedestrian position distribution in the dataset, the paper proposes a simple translation augmentation method. By randomly translating images, the method disperses concentrated pedestrian areas, thereby improving the model's detection performance in various regions. Through these methods, the paper achieves state-of-the-art performance on the KAIST and LLVIP datasets, validating the effectiveness of the proposed methods.

INSANet: INtra-INter Spectral Attention Network for Effective Feature Fusion of Multispectral Pedestrian Detection

Locality guided cross-modal feature aggregation and pixel-level fusion for multispectral pedestrian detection

Multispectral Pedestrian Detection via Simultaneous Detection and Segmentation

Deep saliency detection-based pedestrian detection with multispectral multi-scale features fusion network

Transformer fusion and histogram layer multispectral pedestrian detection network

Cascaded information enhancement and cross-modal attention feature fusion for multispectral pedestrian detection

BAANet: Learning Bi-directional Adaptive Attention Gates for Multispectral Pedestrian Detection

Multispectral Deep Neural Networks for Pedestrian Detection

A Fast RetinaNet Fusion Framework for Multi-Spectral Pedestrian Detection

Improving Multispectral Pedestrian Detection with Scale‐aware Permutation Attention and Adjacent Feature Aggregation

Attention-based Cross-modality Interaction for Multispectral Pedestrian Detection

Fusion of Multispectral Data Through Illumination-aware Deep Neural Networks for Pedestrian Detection

TS-Net: two-stream network based on multispectral feature fusion

Robust Pedestrian Detection Based on Multi-Spectral Image Fusion and Convolutional Neural Networks

MFMANet: a multispectral pedestrian detection network using multi-resolution RGB feature reuse with multi-scale FIR attentions

Multispectral pedestrian detection based on feature complementation and enhancement

Pedestrian detection with unsupervised multispectral feature learning using deep neural networks

Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems

Illumination-aware Faster R-CNN for Robust Multispectral Pedestrian Detection

Box-level Segmentation Supervised Deep Neural Networks for Accurate and Real-time Multispectral Pedestrian Detection

Cross-modality interactive attention network for multispectral pedestrian detection.