M2FNet: Mask-guided Multi-level Fusion for RGB-T Pedestrian Detection
Xiangyang Li,Shiguo Chen,Chunna Tian,Heng Zhou,Zhenxi Zhang
DOI: https://doi.org/10.1109/tmm.2024.3381377
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:RGB-Thermal pedestrian detection has shown many notable advantages in various lighting and weather conditions by combining the information from RGB-T images. Due to distinct imaging principles, RGB-T modalities consist of modality-specific and modality-consistent information. However, most existing RGB-T pedestrian detection methods indiscriminately integrate these two types of information, which leads to the pollution of modality information. To address this issue, we propose a novel mask-guided multi-level fusion network (M2FNet) for RGB-T pedestrian detection. M2FNet independently explores consistent and specific features in RGB-T modalities at three different levels, utilizing pixel-level positional information in masks to exclusively focus on pedestrian-related features. Specifically, at the feature extraction level, we selectively embed cross-modality differential compensation (CDC) modules and design the bidirectional multiscale fusion (BMF) module to fully utilize the complementary modality-specific information and enhance the precision of predicted pedestrian masks. At the feature fusion level, the mask-guided global consistency mining (MGCM) module is introduced to capture intra-modal and inter-modal consistent information of pedestrians, which generates highly discriminative RGB-T features. Finally, to further reduce inter-modal differences, we propose a mask-guided pixel-level decision fusion (MPDF) strategy to dynamically weight the RGB-T predictions. Extensive experiments and comparisons demonstrate that our proposed M2FNet, with different backbones, outperforms the state-of-the-art detectors on both publicly available KAIST and CVC-14 RGB-T pedestrian detection datasets.
computer science, information systems,telecommunications, software engineering