Abstract:Person Re-identification is a research area with significant real world applications. Despite recent progress, existing methods face challenges in robust re-identification in the wild, e.g., by focusing only on a particular modality and on unreliable patterns such as clothing. A generalized method is highly desired, but remains elusive to achieve due to issues such as the trade-off between spatial and temporal resolution and imperfect feature extraction. We propose VILLS (Video-Image Learning to Learn Semantics), a self-supervised method that jointly learns spatial and temporal features from images and videos. VILLS first designs a local semantic extraction module that adaptively extracts semantically consistent and robust spatial features. Then, VILLS designs a unified feature learning and adaptation module to represent image and video modalities in a consistent feature space. By Leveraging self-supervised, large-scale pre-training, VILLS establishes a new State-of-The-Art that significantly outperforms existing image and video-based methods.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to address several key challenges in the task of Person Re-Identification (ReID):
1. **Modality Differences**: Existing methods usually focus on a single modality in images or videos, ignoring the complementary information between modalities. For example, image-based methods excel at extracting high-resolution static features but lack temporal information; video-based methods can capture dynamic features but sacrifice spatial resolution.
2. **Lack of Robustness**: Current methods perform poorly in complex scenarios, such as changes in lighting, pose, and viewpoint. The performance significantly drops, especially when dealing with individuals with similar appearances or the same person in different clothing.
3. **Inconsistent Feature Extraction**: Current methods struggle to extract semantically consistent features, i.e., those individual-specific attributes that remain stable across different scenes, such as facial structure, body proportions, and movement patterns.
To address these challenges, the paper proposes VILLS (Video-Image Learning to Learn Semantics), a self-supervised method that jointly learns spatiotemporal features from images and videos, achieving more robust and generalized person re-identification.
### Main Contributions
1. **Local Semantic Extraction Module (LSE)**: This module adaptively extracts semantically consistent features by combining keypoint detectors and interactive segmentation models, improving the interpretability and effectiveness of features, especially in complex ReID scenarios.
2. **Unified Feature Learning and Adaptation Module (UFLA)**: This module aligns image and video features into a common feature space through shared encoders, selective resampling, and self-supervised learning, bridging the gap between image and video modalities and extending the capabilities of LSE to the video domain.
3. **Experimental Validation**: Extensive experiments were conducted on eight different datasets, including three different downstream ReID tasks, covering both image and video domains. The experimental results show that VILLS significantly outperforms existing image and video ReID methods on multiple key metrics, with improvements of 9.3%, 5.7%, and 6.8% on certain tasks.
### Method Overview
1. **Local Semantic Extraction Module (LSE)**:
- Uses keypoint detectors and interactive segmentation models to adaptively extract local semantic features from images or video frames.
- Constructs prompt vectors to extract fine-grained spatial features from specified regions.
- Introduces a filtering mechanism to improve the accuracy and consistency of features.
2. **Unified Feature Learning and Adaptation Module (UFLA)**:
- Converts image and video inputs into feature tokens through shared encoders.
- Uses a selective resampling strategy to choose the most important feature tokens, ensuring the model adaptively focuses on the most relevant information.
- Learns feature distributions from large-scale unlabeled datasets through self-supervised learning, further enhancing the model's robustness and generalization ability.
- Introduces alignment loss to ensure consistent representations learned from different modalities of the same video source.
### Experimental Results
- **Image ReID**: VILLS achieved SOTA performance on multiple metrics in the PRCC, LTCC, and Market-1501 datasets.
- **Video ReID**: VILLS also performed excellently on the PRID2011 and MARS datasets, especially in rank-1 accuracy and mAP metrics.
- **Image-Video Hybrid ReID**: VILLS significantly outperformed other methods in rank-1 accuracy, TAR@0.01%FAR, and other metrics on the BRIAR-2, BRIAR-3, and BRIAR-4 datasets.
### Conclusion
By combining the advantages of image and video modalities, VILLS achieves more robust and generalized person re-identification. The experimental results validate the effectiveness of this method, especially in handling complex scenarios and multi-modal data.