Abstract:Visible-infrared person re-identification(VI-ReID) aims to match the person images captured by visible and infrared cameras and suffers from severe cross-modality discrepancy and intra-modality variations. Existing approaches mainly use convolution neural network (CNN)-based architectures to extract pedestrian features, which fail to capture the long-range dependencies within an image. In addition, previous works usually attempt to bridge the modality gap by using adversarial learning to generate style-consistent images or designing different feature-level metric learning constraints. However, few works consider the cross-modality disparity from the perspective of assessing overall distance distribution discrepancy. To address these problems, we design a pure Transformer-based Visible-Infrared (TransVI) network with a conventional two-stream structure, which can explicitly capture modality-specific representations and learn multi-modality sharable knowledge. TransVI can efficiently address the lack of global dependency in CNN-based architectures due to the multi-head self-attention modules in the transformer, which allows us to capture the long-range dependencies of pedestrian images. Furthermore, we introduce the Cross-Modality Dissimilarity-based Maximum Mean Discrepancy (CMD-MMD) constraint to handle the cross-modality discrepancy at the distance distribution level. Specifically, CMD-MMD leverages intra-modality distribution separability to guide inter-modality distribution separability learning, aligning pair-wise distance distributions of intra- and inter-modality for within-class and between-class, respectively. In this way, the distance distributions of intra- and inter-modality become more similar, significantly mitigating the cross-modality discrepancy and learning more modality invariant representations. Extensive experimental results on two public VI-ReID datasets confirm that our proposed framework can achieve state-of-the-art performance.

Spatial-Channel Enhanced Transformer for Visible-Infrared Person Re-Identification

Dual-stream Transformer with Distribution Alignment for Visible-Infrared Person Re-Identification

Person Re-identification Based on Transform Algorithm

Visible-Infrared Person Re-Identification via Cross-Modality Interaction Transformer

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Cross-Modality Transformer With Modality Mining for Visible-Infrared Person Re-Identification

Cross-Modality Transformer for Visible-Infrared Person Re-Identification

Enhanced visible–infrared person re-identification based on cross-attention multiscale residual vision transformer

Learning Progressive Modality-shared Transformers for Effective Visible-Infrared Person Re-identification

Transformer-Based Feature Compensation Network for Aerial Photography Person and Ground Object Recognition

Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification

An Efficient Framework for Visible-Infrared Cross Modality Person Re-Identification

Feature separation and double causal comparison loss for visible and infrared person re-identification

TransReID: Transformer-based Object Re-Identification

Occluded Visible-Infrared Person Re-Identification

A Video Is Worth Three Views: Trigeminal Transformers for Video-Based Person Re-Identification

Dynamic Identity-Guided Attention Network for Visible-Infrared Person Re-identification

Progressive Discriminative Feature Learning for Visible-Infrared Person Re-Identification

Visible-Infrared Person Re-Identification Based on Frequency-Domain Simulated Multispectral Modality for Dual-Mode Cameras

Frequency Domain Modality-invariant Feature Learning for Visible-infrared Person Re-Identification

CycleTrans: Learning Neutral yet Discriminative Features for Visible-Infrared Person Re-Identification