Convolutional and Transformer Fusion Network Based on Cross-Attention for Occluded Person Re-identification

Xing Hong,Langwen Zhang,Hongzhen Cai,Wei Xie
DOI: https://doi.org/10.1109/ccdc62350.2024.10588218
2024-01-01
Abstract:Occluded person re-identification studies the problem of cross-camera retrieval of pedestrians in occluded scenarios, which faces problems of insufficient feature representation and low retrieval accuracy. Most existing methods are based on Convolutional Neural Network (CNN) methods, which design different network structures and modules, but none of them can avoid the long-range dependency problem caused by the insufficient global receptive field that always exists in CNN-based methods. With the effective application of transformer in vision (ViT), some pure transformer schemes have also been proposed to accomplish the task of occluded person re-identification problem, but they are still deficient in achieving translation invariance. Considering this situation, this paper proposes a Convolutional and Transformer Fusion Network (CTF-Net) that combines the advantages of the above two types of methods, which utilizes the cross-attention block to interactively fuse semantic information between CNN features and ViT sequences, and obtains pedestrian features that are more robust and representative in dealing with occlusion interference. Experimental results show that our method not only outperforms existing state-of-the-art methods on occluded person re-identification datasets, but also has excellent performance on general person re-identification datasets, demonstrating its strong robustness against occlusion interference, as well as good generalization performance and excellent application scenario scalability.
What problem does this paper attempt to address?