Cross-modality neighbor constraints based unbalanced multi-view text–image re-identification

Yongxi Li,Wenzhong Tang,Ke Zhang,Xi Zhu,Haoming Wang,Shuai Wang
DOI: https://doi.org/10.1007/s00530-024-01530-6
IF: 3.9
2024-11-20
Multimedia Systems
Abstract:Text-to-image Person Re-Identification (TIReID) is an identity retrieval task between visual and textual modalities. Previous research focuses on learning rich and diverse modality-shared semantic features and achieving excellent performance. However, they still have several notable limitations: (1)Noisy influence: Due to the difficulty of cross-modality annotation and the uncertainty of crowdsourced label quality, it is inevitable to introduce noisy labels by incorrect text–image pairs. (2)Sample imbalance: Datasets collected from real-world sources often face an unbalanced distribution of samples across different categories, which results in inconsistent parameters update progress with the training phrase. To address these issues, we propose a two-stage training pipeline for TIReID learning with noisy correspondence. Firstly, we employ a Noisy Correspondence Detector based on heterogeneous relation retrieval estimating confidence weights from each sample pair. Secondly, we design a multi-view triplet loss function, which leverages sample-level features to interact with global class centers, addressing sample imbalance and facilitating a smoother distribution in feature space. Finally, we utilize these clean samples to train the model through a progressive learning process. Extensive experiments on RSTPReid, CUHK-PEDES, and ICFG-PEDES demonstrate the effectiveness of our method against the state-of-the-art TIReID methods.
computer science, information systems, theory & methods
What problem does this paper attempt to address?