Dynamic Token Selection for Aerial-Ground Person Re-Identification

Yuhai Wang
2024-11-30
Abstract:We propose a View-Decoupled Transformer (VDT) framework to address viewpoint discrepancies in person re-identification (ReID), particularly between aerial and ground views. VDT decouples view-specific and view-independent features by leveraging meta and view tokens, processed through self-attention and subtractive separation. Additionally, we introduce a Visual Token Selector (VTS) module that dynamically selects the most informative tokens, reducing redundancy and enhancing efficiency. Our approach significantly improves retrieval performance on the AGPReID dataset, while maintaining computational efficiency similar to baseline models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the cross - view (especially air - to - ground) person re - identification (ReID) tasks, due to the influence of background interference and irrelevant regions, traditional methods lead to performance degradation and high computational cost. Specifically: 1. **Background Interference and Irrelevant Information**: Traditional ReID methods usually process the entire image globally and are easily affected by background information and irrelevant regions, especially when the target object is partially occluded or located in a complex background. 2. **High Computational Cost**: These methods need to process the entire image instead of only focusing on relevant regions, thus resulting in a relatively high computational cost. To solve these problems, the paper proposes a new method, that is, by dynamically selecting key tokens related to the target object, reducing the computational overhead and enabling the model to focus on important regions, thereby avoiding processing the entire image. This method aims to improve the robustness and efficiency of ReID tasks, especially in diverse environments and challenging conditions. ### Specific Problem Summary: - **Background Interference**: Traditional methods perform poorly when dealing with complex backgrounds and partial occlusions. - **Low Computational Efficiency**: Processing the entire image leads to a waste of computational resources. - **Difficult Cross - view Matching**: Especially, the matching between air - to - ground views is more complex and challenging. ### Solutions: - **Dynamic Token Selection**: By selecting the tokens most relevant to the target object, redundant calculations are reduced and the model's attention to key regions is increased. - **View - Decoupled Transformer (VDT)**: Meta tokens and view tokens are introduced to separate global features and view - related features, further enhancing the robustness and efficiency of the model. Through these innovations, the paper aims to improve the performance in cross - view person re - identification tasks, especially the performance in diverse environments and complex conditions encountered in practical applications.