Asymmetric Mutual Learning for Unsupervised Transferable Visible-Infrared Re-Identification

Ancong Wu,Chengzhi Lin,Wei-Shi Zheng
DOI: https://doi.org/10.1109/tcsvt.2024.3404786
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Visible-infrared person re-identification (Re-ID) plays a crucial role in matching people across camera views in the darkness and normal lighting. To reduce annotation cost, it is advantageous to learn Re-ID model from unlabeled visible-infrared image pairs. However, large modality gap makes it difficult to discover the underlying cross-modality sample relations. Compared with cross-modality sample pairs in the target domain, it is easier to obtain more single-modality visible image samples from other domains. In this work, we study unsupervised transfer learning to extract modality-shared knowledge from auxiliary unlabeled visible images in a source domain and leverage this knowledge to learn cross-modality matching in the unlabeled target domain. Our framework consists of two stages: RGB-gray asymmetric mutual learning and unsupervised cross-modality self-training. In the first stage, to extract visible-infrared shared information from auxiliary unlabeled visible images, we regard RGB images and grayscale fake infrared images transformed from RGB images as two views to learn view-shared information and simultaneously preserve RGB-specific information. Based on information theoretic analysis, we learn an RGB-gray feature extractor and further introduce an auxiliary gray feature extractor to quantify RGB-gray shared knowledge. This knowledge is then transferred to the RGB-gray feature extractor without eliminating RGB-specific information. We call this process Cross-Modality Asymmetric Mutual Learning (CMAM). In the second stage, for unsupervised cross-modality self-training in the target domain, we fuse the complementary knowledge in two models by mutual learning and employ bipartite cross-modality pseudo labeling to alleviate modality gap. For a more extensive evaluation, we collected a new public multi-modality dataset, SYSU-MM02, constructed from untrimmed videos. Our method achieves the state-of-the-art performance on three benchmark datasets.
What problem does this paper attempt to address?