Abstract:We propose a View-Decoupled Transformer (VDT) framework to address viewpoint discrepancies in person re-identification (ReID), particularly between aerial and ground views. VDT decouples view-specific and view-independent features by leveraging meta and view tokens, processed through self-attention and subtractive separation. Additionally, we introduce a Visual Token Selector (VTS) module that dynamically selects the most informative tokens, reducing redundancy and enhancing efficiency. Our approach significantly improves retrieval performance on the AGPReID dataset, while maintaining computational efficiency similar to baseline models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the cross - view (especially air - to - ground) person re - identification (ReID) tasks, due to the influence of background interference and irrelevant regions, traditional methods lead to performance degradation and high computational cost. Specifically: 1. **Background Interference and Irrelevant Information**: Traditional ReID methods usually process the entire image globally and are easily affected by background information and irrelevant regions, especially when the target object is partially occluded or located in a complex background. 2. **High Computational Cost**: These methods need to process the entire image instead of only focusing on relevant regions, thus resulting in a relatively high computational cost. To solve these problems, the paper proposes a new method, that is, by dynamically selecting key tokens related to the target object, reducing the computational overhead and enabling the model to focus on important regions, thereby avoiding processing the entire image. This method aims to improve the robustness and efficiency of ReID tasks, especially in diverse environments and challenging conditions. ### Specific Problem Summary: - **Background Interference**: Traditional methods perform poorly when dealing with complex backgrounds and partial occlusions. - **Low Computational Efficiency**: Processing the entire image leads to a waste of computational resources. - **Difficult Cross - view Matching**: Especially, the matching between air - to - ground views is more complex and challenging. ### Solutions: - **Dynamic Token Selection**: By selecting the tokens most relevant to the target object, redundant calculations are reduced and the model's attention to key regions is increased. - **View - Decoupled Transformer (VDT)**: Meta tokens and view tokens are introduced to separate global features and view - related features, further enhancing the robustness and efficiency of the model. Through these innovations, the paper aims to improve the performance in cross - view person re - identification tasks, especially the performance in diverse environments and complex conditions encountered in practical applications.

Dynamic Token Selection for Aerial-Ground Person Re-Identification

Person Re-identification Based on Transform Algorithm

View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network

RETRACTED CHAPTER: Person Re-identification Based on Transform Algorithm

A Video Is Worth Three Views: Trigeminal Transformers for Video-Based Person Re-Identification

Parameter instance learning with enhanced vision transformers for aerial person re‐identification

Occluded person re-identification based on parallel triplet augmentation and parameter-free token spatial attention

EdgeVPR: Transformer-Based Real-Time Video Person Re-Identification at the Edge

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

Spatial-Channel Enhanced Transformer for Visible-Infrared Person Re-Identification

Transformer-Based Feature Compensation Network for Aerial Photography Person and Ground Object Recognition

Cross-View Multi-Scale Re-Identification Network in the Perspective of Ground Rotorcraft Unmanned Aerial Vehicle

Efficient Video Transformers with Spatial-Temporal Token Selection

Boosting Person Re-Identification with Viewpoint Contrastive Learning and Adversarial Training

Dual-stream Transformer with Distribution Alignment for Visible-Infrared Person Re-Identification

Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification

Generalizable Person Re-Identification via Viewpoint Alignment and Fusion

Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification.

Enhanced visible–infrared person re-identification based on cross-attention multiscale residual vision transformer