Abstract:Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP). However, the absence of concrete descriptions necessitates the use of implicit text embeddings, which demand complicated and inefficient training strategies. To address this issue, we first propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images, and thereby boost person re-identification with large vision language models. Using models like the Large Language and Vision Assistant (LLAVA), we generate high-quality captions based on fixed templates that capture key semantic attributes such as gender, clothing, and age. By augmenting ReID training sets from uni-modality (image) to bi-modality (image and text), we introduce CLIP-SCGI, a simple yet effective framework that leverages synthesized captions to guide the learning of discriminative and robust representations. Built on CLIP, CLIP-SCGI fuses image and text embeddings through two modules to enhance the training process. To address quality issues in generated captions, we introduce a caption-guided inversion module that captures semantic attributes from images by converting relevant visual information into pseudo-word tokens based on the descriptions. This approach helps the model better capture key information and focus on relevant regions. The extracted features are then utilized in a cross-modal fusion module, guiding the model to focus on regions semantically consistent with the caption, thereby facilitating the optimization of the visual encoder to extract discriminative and robust representations. Extensive experiments on four popular ReID benchmarks demonstrate that CLIP-SCGI outperforms the state-of-the-art by a significant margin.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to address several key issues in the Person Re-Identification (ReID) task: 1. **Limitations of Single-Modality Feature Learning**: Existing ReID methods mainly rely on the image modality, lacking sufficient semantic information to extract comprehensive features, which limits overall performance. 2. **Small Dataset Size**: Current ReID datasets are relatively small, causing models trained on these datasets to easily overfit, leading to poor performance in real and diverse scenarios. 3. **Quality Issues of Generated Descriptions**: Although large pre-trained vision-language models (such as CLIP) can generate high-quality descriptions, the generated descriptions may still contain errors, affecting the model's training effectiveness. To address these issues, the authors propose a new framework called CLIP-SCGI (Synthesized Caption-Guided Inversion for Person Re-Identification), which enhances the performance of the ReID task by generating synthesized descriptions. Specifically, this framework utilizes a pre-trained image description model to generate high-quality descriptions and guides the model to learn more discriminative and robust representations through these descriptions. ### Main Contributions 1. **Proposed a Simple and Effective Representation Learning Scheme**: This scheme uses synthesized descriptions to guide the learning of the ReID task, enabling the learning of richer and more flexible representations compared to existing identity-based methods. 2. **Designed a Simple Framework**: Based on the CLIP model, it fully leverages the correlation between visual and semantic features, enhancing the model's ability to extract discriminative features. Experimental results demonstrate the effectiveness of this framework. 3. **Introduced a Text-Guided Inversion Module**: This module generates pseudo-words through synthesized descriptions, helping the model better capture key information and improve the accuracy of feature learning. ### Experimental Results The authors conducted extensive experiments on four popular ReID benchmark datasets (Market-1501, MSMT17, DukeMTMC-reID, and Occluded-Duke), and the results show that CLIP-SCGI significantly outperforms existing state-of-the-art methods. For example, on the MSMT17 dataset, this framework achieved 88.2% mAP, setting a new state-of-the-art result. ### Conclusion This paper enhances the performance of the ReID task by introducing synthesized descriptions, addressing the limitations of existing methods in feature learning and dataset size. The proposed CLIP-SCGI framework is not only simple and effective but also achieves significant performance improvements on multiple benchmark datasets.

CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification

CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels

Person Re-identification Based on Transform Algorithm

Unveiling the Power of CLIP in Unsupervised Visible-Infrared Person Re-Identification

Boosting Generalization Performance in Person Re-identification.

CLIP-based Camera-Agnostic Feature Learning for Intra-camera Person Re-Identification

TF-CLIP: Learning Text-Free CLIP for Video-Based Person Re-identification

Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification

CLIP-Driven Semantic Discovery Network for Visible-Infrared Person Re-Identification

Prompt Decoupling for Text-to-Image Person Re-identification

Text-augmented Multi-Modality contrastive learning for unsupervised visible-infrared person re-identification

CLIP-Driven Fine-grained Text-Image Person Re-identification

See What You Seek: Semantic Contextual Integration for Cloth-Changing Person Re-Identification

CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person Re-Identification

Image Re-Identification: Where Self-supervision Meets Vision-Language Learning

Devil's in the Details: Aligning Visual Clues for Conditional Embedding in Person Re-Identification

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

Contrastive completing learning for practical text–image person ReID: Robuster and cheaper

Text-and-Image Learning Transformer for Cross-modal Person Re-identification