Video Person Re-identification Based on Transformer-CNN Model

Liang Zhao,Qiongfang Yu,Yi Yang
DOI: https://doi.org/10.1109/aiam57466.2022.00091
2022-01-01
Abstract:To overcome the problems of pose variation, complex background and more occlusion in video person re-identification, a network model ResTNet based on convolutional neural network and Transformer was proposed. ResNet50 network was used to obtain local features and the output of its middle layer was input to Transformer as prior knowledge in ResTNet. In the Transformer branch, the size of the feature map was continuously reduced. The field of perception was expanded to fully explore the relationships among local features, and generated global features of pedestrians. The model computation was also decreased with the shift window method. Cross-entropy loss and triplet loss were used to optimize the model for the two branches during training, respectively. The Rank-1 and mAP on the large-scale MARS dataset reached 86.8% and 80.3%, respectively, which were 3.8% and 3.3% higher than the benchmark. The Transformer model was not only successfully applied to the field of video person re-identification, but also extensive experiments on several large datasets showed that the proposed ResTNet network can enhance the robustness of the recognition and improve the accuracy of person re-identification effectively.
What problem does this paper attempt to address?