Cascaded Cross-modal Alignment for Visible-Infrared Person Re-Identification

Zhaohui Li,Qiangchang Wang,Lu Chen,Xinxin Zhang,Yilong Yin
DOI: https://doi.org/10.1016/j.knosys.2024.112585
IF: 8.139
2024-01-01
Knowledge-Based Systems
Abstract:Visible-Infrared Person Re-Identification faces significant challenges due to cross-modal and intra-modal variations. Although existing methods explore semantic alignment from various angles, severe distribution shifts in heterogeneous data limit the effectiveness of single-level alignment approaches. To address this issue, we propose a Cascaded Cross-modal Alignment (CCA) framework that gradually eliminates distribution discrepancies and aligns semantic features from three complementary perspectives in a cascaded manner. First, at the input-level, we propose a Channel-Spatial Recombination (CSR) strategy that strategically reorganizes and preserves crucial details from channel and spatial dimensions to diminish visual discrepancies between modalities, thereby narrowing the modality gap in input images. Second, at the frequency-level, we introduce a Low Frequency Masking (LFM) module to emphasize global details that CSR might overlook by randomly masking low-frequency information, thus driving comprehensive alignment of identity semantics. Third, at the part-level, we design a Prototype-based Semantic Refinement (PSR) module to refine fine-grained features and mitigate the impact of irrelevant areas in LFM. It accurately aligns body parts and enhances semantic consistency guided by global discriminative clues from LFM and flipped views with pose variations. Comprehensive experimental results on the SYSU-MM01 and RegDB datasets demonstrate the superiority of our proposed CCA.
What problem does this paper attempt to address?