Exploring Stronger Transformer Representation Learning for Occluded Person Re-Identification

Zhangjian Ji,Donglin Cheng,Kai Feng
2024-10-23
Abstract:Due to some complex factors (e.g., occlusion, pose variation and diverse camera perspectives), extracting stronger feature representation in person re-identification remains a challenging task. In this paper, we proposed a novel self-supervision and supervision combining transformer-based person re-identification framework, namely SSSC-TransReID. Different from the general transformer-based person re-identification models, we designed a self-supervised contrastive learning branch, which can enhance the feature representation for person re-identification without negative samples or additional pre-training. In order to train the contrastive learning branch, we also proposed a novel random rectangle mask strategy to simulate the occlusion in real scenes, so as to enhance the feature representation for occlusion. Finally, we utilized the joint-training loss function to integrate the advantages of supervised learning with ID tags and self-supervised contrastive learning without negative samples, which can reinforce the ability of our model to excavate stronger discriminative features, especially for occlusion. Extensive experimental results on several benchmark datasets show our proposed model obtains superior Re-ID performance consistently and outperforms the state-of-the-art ReID methods by large margins on the mean average accuracy (mAP) and Rank-1 accuracy.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: under the influence of complex factors (such as occlusion, pose change, and diverse camera viewpoints), how to extract stronger feature representations for person re - identification (ReID). Specifically, the author proposes a new Transformer - based person re - identification framework, aiming to address the challenges brought by occlusion, especially simulating occlusion situations in real - life scenarios to enhance the feature representation ability of the model. ### Core Problems of the Paper 1. **Occlusion Problem**: - Occlusion will introduce noise information, leading to matching errors. - The occluded part may have similar features to human body parts, making it difficult to learn more discriminative features. - Changes in human pose, camera viewpoint, and human movement between frames may lead to inaccurate feature alignment. 2. **Limitations of Existing Methods**: - Most existing person re - identification methods assume that the entire body is visible in the camera view, so they perform poorly in scenes with a large amount of occlusion. - Common data augmentation strategies (such as color distortion, random horizontal flipping, and random erasing) are only applicable to normal scenes and cannot handle occluded images because these models lack occluded data for training. ### Proposed Solutions To solve the above problems, the author proposes the following innovations: 1. **Random Rectangular Occlusion Strategy**: - A new data augmentation method - random rectangular occlusion strategy (Random Rectangle Mask) is designed to simulate occlusion situations in real - life scenarios, thereby learning more robust feature representations. - Combined with other image enhancement methods such as Gaussian blur, random color jitter, and solarization, the generalization ability of the model is further improved. 2. **Self - Supervised Contrastive Learning Branch**: - A self - supervised contrastive learning branch based on the shared ViT structure is constructed. It does not need to rely on negative sample sampling but directly uses the internal structure information of the image for self - supervised learning, enhancing the feature learning ability of the ViT encoder. 3. **Joint Training Loss Function**: - The joint training loss function is used to combine the advantages of supervised learning (with ID labels) and self - supervised contrastive learning without negative samples, promoting the improvement of model performance. ### Experimental Results Through extensive experiments on multiple benchmark datasets, it is verified that the proposed model is significantly superior to existing person re - identification methods in terms of mean Average Precision (mAP) and Rank - 1 accuracy, especially on datasets with a high occlusion ratio. ### Summary The main contribution of this paper lies in proposing a joint training framework that combines self - supervised and supervised learning, effectively solving the person re - identification problem in occlusion scenarios, and achieving excellent performance on multiple benchmark datasets.