Image Re-Identification: Where Self-supervision Meets Vision-Language Learning

Bin Wang,Yuying Liang,Lei Cai,Huakun Huang,Huanqiang Zeng

2024-07-30

Abstract:Recently, large-scale vision-language pre-trained models like CLIP have shown impressive performance in image re-identification (ReID). In this work, we explore whether self-supervision can aid in the use of CLIP for image ReID tasks. Specifically, we propose SVLL-ReID, the first attempt to integrate self-supervision and pre-trained CLIP via two training stages to facilitate the image ReID. We observe that: 1) incorporating language self-supervision in the first training stage can make the learnable text prompts more distinguishable, and 2) incorporating vision self-supervision in the second training stage can make the image features learned by the image encoder more discriminative. These observations imply that: 1) the text prompt learning in the first stage can benefit from the language self-supervision, and 2) the image feature learning in the second stage can benefit from the vision self-supervision. These benefits jointly facilitate the performance gain of the proposed SVLL-ReID. By conducting experiments on six image ReID benchmark datasets without any concrete text labels, we find that the proposed SVLL-ReID achieves the overall best performances compared with state-of-the-arts. Codes will be publicly available at <a class="link-external link-https" href="https://github.com/BinWangGzhu/SVLL-ReID" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the challenging issues in the image re-identification (ReID) task, especially when images captured by different cameras contain various interfering factors (such as cluttered backgrounds, lighting changes, object occlusions, etc.). Specifically, the paper proposes the SVLL-ReID method, which is the first attempt to combine self-supervised learning with large-scale vision-language pre-training models (such as CLIP) for the image re-identification task. SVLL-ReID achieves this goal through two training stages: 1. **First Stage**: Introduce a language self-supervised mechanism to optimize learnable text prompts, making them more distinctive. 2. **Second Stage**: Introduce a visual self-supervised mechanism to optimize the learning features of the image encoder, making them more discriminative. Experiments on 6 standard image re-identification datasets show that SVLL-ReID achieves significant performance improvements compared to existing state-of-the-art methods, even without specific text labels. This indicates that combining self-supervised learning with vision-language learning indeed helps improve the performance of image re-identification tasks.

Image Re-Identification: Where Self-supervision Meets Vision-Language Learning

CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels

Exploring Part-Informed Visual-Language Learning for Person Re-Identification

Unveiling the Power of CLIP in Unsupervised Visible-Infrared Person Re-Identification

CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification

SLIP: Self-supervision meets Language-Image Pre-training

Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification

VLUReID: Exploiting Vision-Language Knowledge for Unsupervised Person Re-Identification

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Boosting Generalization Performance in Person Re-identification.

Vision-by-Language for Training-Free Compositional Image Retrieval

MLLMReID: Multimodal Large Language Model-based Person Re-identification

Unsupervised Visible-Infrared Person ReID by Collaborative Learning with Neighbor-Guided Label Refinement

Replacement as a Self-supervision for Fine-grained Vision-language Pre-training

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

When Large Vision-Language Models Meet Person Re-Identification