Abstract:Person Re-identification is a research area with significant real world applications. Despite recent progress, existing methods face challenges in robust re-identification in the wild, e.g., by focusing only on a particular modality and on unreliable patterns such as clothing. A generalized method is highly desired, but remains elusive to achieve due to issues such as the trade-off between spatial and temporal resolution and imperfect feature extraction. We propose VILLS (Video-Image Learning to Learn Semantics), a self-supervised method that jointly learns spatial and temporal features from images and videos. VILLS first designs a local semantic extraction module that adaptively extracts semantically consistent and robust spatial features. Then, VILLS designs a unified feature learning and adaptation module to represent image and video modalities in a consistent feature space. By Leveraging self-supervised, large-scale pre-training, VILLS establishes a new State-of-The-Art that significantly outperforms existing image and video-based methods.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address several key challenges in the task of Person Re-Identification (ReID): 1. **Modality Differences**: Existing methods usually focus on a single modality in images or videos, ignoring the complementary information between modalities. For example, image-based methods excel at extracting high-resolution static features but lack temporal information; video-based methods can capture dynamic features but sacrifice spatial resolution. 2. **Lack of Robustness**: Current methods perform poorly in complex scenarios, such as changes in lighting, pose, and viewpoint. The performance significantly drops, especially when dealing with individuals with similar appearances or the same person in different clothing. 3. **Inconsistent Feature Extraction**: Current methods struggle to extract semantically consistent features, i.e., those individual-specific attributes that remain stable across different scenes, such as facial structure, body proportions, and movement patterns. To address these challenges, the paper proposes VILLS (Video-Image Learning to Learn Semantics), a self-supervised method that jointly learns spatiotemporal features from images and videos, achieving more robust and generalized person re-identification. ### Main Contributions 1. **Local Semantic Extraction Module (LSE)**: This module adaptively extracts semantically consistent features by combining keypoint detectors and interactive segmentation models, improving the interpretability and effectiveness of features, especially in complex ReID scenarios. 2. **Unified Feature Learning and Adaptation Module (UFLA)**: This module aligns image and video features into a common feature space through shared encoders, selective resampling, and self-supervised learning, bridging the gap between image and video modalities and extending the capabilities of LSE to the video domain. 3. **Experimental Validation**: Extensive experiments were conducted on eight different datasets, including three different downstream ReID tasks, covering both image and video domains. The experimental results show that VILLS significantly outperforms existing image and video ReID methods on multiple key metrics, with improvements of 9.3%, 5.7%, and 6.8% on certain tasks. ### Method Overview 1. **Local Semantic Extraction Module (LSE)**: - Uses keypoint detectors and interactive segmentation models to adaptively extract local semantic features from images or video frames. - Constructs prompt vectors to extract fine-grained spatial features from specified regions. - Introduces a filtering mechanism to improve the accuracy and consistency of features. 2. **Unified Feature Learning and Adaptation Module (UFLA)**: - Converts image and video inputs into feature tokens through shared encoders. - Uses a selective resampling strategy to choose the most important feature tokens, ensuring the model adaptively focuses on the most relevant information. - Learns feature distributions from large-scale unlabeled datasets through self-supervised learning, further enhancing the model's robustness and generalization ability. - Introduces alignment loss to ensure consistent representations learned from different modalities of the same video source. ### Experimental Results - **Image ReID**: VILLS achieved SOTA performance on multiple metrics in the PRCC, LTCC, and Market-1501 datasets. - **Video ReID**: VILLS also performed excellently on the PRID2011 and MARS datasets, especially in rank-1 accuracy and mAP metrics. - **Image-Video Hybrid ReID**: VILLS significantly outperformed other methods in rank-1 accuracy, TAR@0.01%FAR, and other metrics on the BRIAR-2, BRIAR-3, and BRIAR-4 datasets. ### Conclusion By combining the advantages of image and video modalities, VILLS achieves more robust and generalized person re-identification. The experimental results validate the effectiveness of this method, especially in handling complex scenarios and multi-modal data.

VILLS: Video-Image Learning to Learn Semantics for Person Re-Identification

VLUReID: Exploiting Vision-Language Knowledge for Unsupervised Person Re-Identification

Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification

When Large Vision-Language Models Meet Person Re-Identification

Unsupervised Visible-Infrared Person ReID by Collaborative Learning with Neighbor-Guided Label Refinement

Exploring Part-Informed Visual-Language Learning for Person Re-Identification

Self-Supervised Learning of Whole and Component-Based Semantic Representations for Person Re-Identification

CLIP-Driven Semantic Discovery Network for Visible-Infrared Person Re-Identification

Video-based Person Re-identification with Long Short-Term Representation Learning

Semantics-Aligned Representation Learning for Person Re-Identification

Joint Color-irrelevant Consistency Learning and Identity-aware Modality Adaptation for Visible-infrared Cross Modality Person Re-identification.

Stronger Heterogeneous Feature Learning for Visible-Infrared Person Re-Identification

Channel semantic mutual learning for visible-thermal person re-identification

Deep video-based person re-identification (Deep Vid-ReID): comprehensive survey

Learning Progressive Modality-shared Transformers for Effective Visible-Infrared Person Re-identification

Enhancing Person Re-Identification Performance Through In Vivo Learning

Relation-Guided Spatial Attention and Temporal Refinement for Video-Based Person Re-Identification.

Cooperative Separation of Modality Shared-Specific Features for Visible-Infrared Person Re-Identification

VP-ReID

PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification

Discriminative Spatial Feature Learning for Person Re-Identification