INSANet: INtra-INter Spectral Attention Network for Effective Feature Fusion of Multispectral Pedestrian Detection

Sangin Lee,Taejoo Kim,Jeongmin Shin,Namil Kim,Yukyung Choi
DOI: https://doi.org/10.3390/s24041168
IF: 3.9
2024-02-11
Sensors
Abstract:Pedestrian detection is a critical task for safety-critical systems, but detecting pedestrians is challenging in low-light and adverse weather conditions. Thermal images can be used to improve robustness by providing complementary information to RGB images. Previous studies have shown that multi-modal feature fusion using convolution operation can be effective, but such methods rely solely on local feature correlations, which can degrade the performance capabilities. To address this issue, we propose an attention-based novel fusion network, referred to as INSANet (INtra-INter Spectral Attention Network), that captures global intra- and inter-information. It consists of intra- and inter-spectral attention blocks that allow the model to learn mutual spectral relationships. Additionally, we identified an imbalance in the multispectral dataset caused by several factors and designed an augmentation strategy that mitigates concentrated distributions and enables the model to learn the diverse locations of pedestrians. Extensive experiments demonstrate the effectiveness of the proposed methods, which achieve state-of-the-art performance on the KAIST dataset and LLVIP dataset. Finally, we conduct a regional performance evaluation to demonstrate the effectiveness of our proposed network in various regions.
engineering, electrical & electronic,chemistry, analytical,instruments & instrumentation
What problem does this paper attempt to address?
The paper attempts to address the issue of improving the performance of multispectral pedestrian detection under low light and adverse weather conditions. Specifically, the paper focuses on the following key issues: 1. **Multispectral Feature Fusion**: Existing multimodal feature fusion methods mainly rely on convolution operations, which can only capture local feature correlations and are difficult to effectively capture long-range spatial dependencies. This limits the model's performance in complex environments. 2. **Capturing Intra- and Inter-Modal Information**: Existing methods often ignore the relationships between intra-modal features during the fusion process, leading to information loss. The paper proposes a new attention mechanism that enhances feature representation by introducing intra- and inter-modal attention blocks. 3. **Dataset Distribution Imbalance**: In multispectral datasets, the distribution of pedestrian positions is imbalanced, especially in road scenes where pedestrians appear concentrated in certain specific areas. This distribution imbalance can affect the model's generalization ability. The paper designs a data augmentation strategy to alleviate this problem through translation transformations. To address the above issues, the paper proposes the following methods: - **INtra-INter Spectral Attention Network (INSANet)**: This is a novel fusion network based on an attention mechanism that can effectively capture global information within and between modalities. INSANet consists of intra-modal attention blocks and inter-modal attention blocks, enhancing feature representation through self-attention and cross-attention mechanisms, and promoting complementary information exchange between different modalities. - **Data Augmentation Strategy**: To alleviate the problem of imbalanced pedestrian position distribution in the dataset, the paper proposes a simple translation augmentation method. By randomly translating images, the method disperses concentrated pedestrian areas, thereby improving the model's detection performance in various regions. Through these methods, the paper achieves state-of-the-art performance on the KAIST and LLVIP datasets, validating the effectiveness of the proposed methods.