Cross-video Identity Correlating for Person Re-identification Pre-training

Jialong Zuo,Ying Nie,Hanyu Zhou,Huaxin Zhang,Haoyu Wang,Tianyu Guo,Nong Sang,Changxin Gao
2024-09-27
Abstract:Recent researches have proven that pre-training on large-scale person images extracted from internet videos is an effective way in learning better representations for person re-identification. However, these researches are mostly confined to pre-training at the instance-level or single-video tracklet-level. They ignore the identity-invariance in images of the same person across different videos, which is a key focus in person re-identification. To address this issue, we propose a Cross-video Identity-cOrrelating pre-traiNing (CION) framework. Defining a noise concept that comprehensively considers both intra-identity consistency and inter-identity discrimination, CION seeks the identity correlation from cross-video images by modeling it as a progressive multi-level denoising problem. Furthermore, an identity-guided self-distillation loss is proposed to implement better large-scale pre-training by mining the identity-invariance within person images. We conduct extensive experiments to verify the superiority of our CION in terms of efficiency and performance. CION achieves significantly leading performance with even fewer training samples. For example, compared with the previous state-of-the-art~\cite{ISR}, CION with the same ResNet50-IBN achieves higher mAP of 93.3\% and 74.3\% on Market1501 and MSMT17, while only utilizing 8\% training samples. Finally, with CION demonstrating superior model-agnostic ability, we contribute a model zoo named ReIDZoo to meet diverse research and application needs in this field. It contains a series of CION pre-trained models with spanning structures and parameters, totaling 32 models with 10 different structures, including GhostNet, ConvNext, RepViT, FastViT and so on. The code and models will be made publicly available at <a class="link-external link-https" href="https://github.com/Zplusdragon/CION_ReIDZoo" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of identity invariance in the task of cross-video person re-identification (Person Re-Identification, ReID). Specifically, existing pre-training methods are mostly limited to instance-level or single-video trajectory-level learning, neglecting the identity invariance of the same person in images across different videos. This identity invariance is crucial for person re-identification. Therefore, the authors propose a Cross-video Identity-cOrrelating pre-traiNing (CION) framework, which models multi-level denoising problems to explore identity correlations in images across videos and introduces an identity-guided self-distillation loss for better large-scale pre-training. ### Main Contributions 1. **Proposing the CION Framework**: This framework explicitly explores the identity invariance in face images extracted from internet videos through progressive multi-level denoising and identity-guided self-distillation, thereby improving the effectiveness of representation learning. 2. **Experimental Validation**: Extensive experiments validate the superiority of CION in terms of efficiency and performance. CION achieves better performance than existing methods with fewer training samples. For example, on the Market1501 and MSMT17 datasets, CION using ResNet50-IBN achieved 93.3% and 74.3% mAP, respectively, while using only 8% of the training samples. 3. **Contributing the ReIDZoo Model Library**: To meet diverse research and application needs, the authors constructed a fully open-source model library, ReIDZoo, which includes 32 CION pre-trained models with different structures and parameters, covering 10 different model architectures such as GhostNet, ConvNext, RepViT, etc. ### Method Overview 1. **Noise Definition**: The authors define a noise concept that comprehensively considers intra-identity consistency and inter-identity distinctiveness. 2. **Progressive Multi-level Denoising**: Through single-trajectory denoising, short-range single-video denoising, and long-range cross-video denoising, the noise in the sample set is gradually reduced to seek better identity correlations. 3. **Identity-guided Self-distillation**: Utilizing identity correlation information, the student network matches the output probability distribution of the teacher network through contrastive learning and self-distillation, thereby better learning identity invariance. ### Experimental Results - **Supervised Person Re-identification**: On the Market1501 and MSMT17 datasets, CION significantly outperforms existing state-of-the-art methods, especially when using fewer training samples. - **Unsupervised Person Re-identification**: In both unsupervised domain adaptation (UDA) and unsupervised learning (USL) settings, CION achieves new state-of-the-art performance, significantly surpassing all previous methods. In summary, this paper effectively addresses the issue of identity invariance in cross-video person re-identification by proposing the CION framework, significantly improving the performance and efficiency of the model.