Image Re-Identification: Where Self-supervision Meets Vision-Language Learning

Bin Wang,Yuying Liang,Lei Cai,Huakun Huang,Huanqiang Zeng
2024-07-30
Abstract:Recently, large-scale vision-language pre-trained models like CLIP have shown impressive performance in image re-identification (ReID). In this work, we explore whether self-supervision can aid in the use of CLIP for image ReID tasks. Specifically, we propose SVLL-ReID, the first attempt to integrate self-supervision and pre-trained CLIP via two training stages to facilitate the image ReID. We observe that: 1) incorporating language self-supervision in the first training stage can make the learnable text prompts more distinguishable, and 2) incorporating vision self-supervision in the second training stage can make the image features learned by the image encoder more discriminative. These observations imply that: 1) the text prompt learning in the first stage can benefit from the language self-supervision, and 2) the image feature learning in the second stage can benefit from the vision self-supervision. These benefits jointly facilitate the performance gain of the proposed SVLL-ReID. By conducting experiments on six image ReID benchmark datasets without any concrete text labels, we find that the proposed SVLL-ReID achieves the overall best performances compared with state-of-the-arts. Codes will be publicly available at <a class="link-external link-https" href="https://github.com/BinWangGzhu/SVLL-ReID" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the challenging issues in the image re-identification (ReID) task, especially when images captured by different cameras contain various interfering factors (such as cluttered backgrounds, lighting changes, object occlusions, etc.). Specifically, the paper proposes the SVLL-ReID method, which is the first attempt to combine self-supervised learning with large-scale vision-language pre-training models (such as CLIP) for the image re-identification task. SVLL-ReID achieves this goal through two training stages: 1. **First Stage**: Introduce a language self-supervised mechanism to optimize learnable text prompts, making them more distinctive. 2. **Second Stage**: Introduce a visual self-supervised mechanism to optimize the learning features of the image encoder, making them more discriminative. Experiments on 6 standard image re-identification datasets show that SVLL-ReID achieves significant performance improvements compared to existing state-of-the-art methods, even without specific text labels. This indicates that combining self-supervised learning with vision-language learning indeed helps improve the performance of image re-identification tasks.