Abstract:Vision Transformers (ViTs) have shown exceptional performance in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video inputs can negatively impact re-identification accuracy. To address this challenge, we propose a novel, human perception driven, and general ViT-based ReID framework that fuses models trained on various aspect ratios. Our key contributions are threefold: (i) We analyze the impact of aspect ratios on performance using the VeRi-776 and VehicleID datasets, providing guidance for input settings based on the distribution of original image aspect ratios. (ii) We introduce patch-wise mixup strategy during ViT patchification (guided by spatial attention scores) and implement uneven stride for better alignment with object aspect ratios. (iii) We propose a dynamic feature fusion ReID network to enhance model robustness. Our method outperforms state-of-the-art transformer-based approaches on both datasets, with only a minimal increase in inference time per image.

What problem does this paper attempt to address?

This paper attempts to address the issue of the negative impact of images or video inputs with different aspect ratios on recognition accuracy in the Vehicle Re-identification (Vehicle ReID) task. Specifically, existing methods based on Vision Transformers (ViTs) may experience performance degradation when handling inputs with non-square aspect ratios. To tackle this challenge, the authors propose a novel, human perception-driven general ViT framework that enhances recognition robustness and accuracy by integrating models trained on different aspect ratios. ### Main Contributions: 1. **Analysis of Aspect Ratio Impact**: The authors analyzed the impact of aspect ratio on performance using the VeRi-776 and VehicleID datasets and provided input setting guidelines based on the distribution of original image aspect ratios. 2. **Introduction of Patch-level Mixing Strategy**: A patch-level mixing strategy based on spatial attention scores was introduced during the patching process of ViT, and non-uniform strides were implemented to better align with object aspect ratios. 3. **Dynamic Feature Fusion Network**: A dynamic feature fusion ReID network was proposed to enhance the model's robustness. ### Experimental Results: - On the VeRi-776 dataset, non-square inputs (224×298) improved mAP by 4.6%, and feature fusion further improved it by 6.5%. - On the VehicleID dataset, 384×396 inputs improved mAP by 0.6%, and feature fusion further improved it by 0.8% and R1 by 1.3%. - Compared to recent ViT-based methods, this model improved mAP by 2.5% on VeRi-776, and achieved 91.0% mAP, 86.3% R1, and 97.4% R5 on the largest test set of VehicleID, which are 8.4% R1 higher than pure ViT. ### Conclusion: By integrating ViT models trained on different aspect ratios, this method significantly enhances the robustness and performance of the vehicle re-identification task. The non-uniform stride patching preserves spatial structure, while the intra-patch mixing strategy improves generalization through random sampling. These results provide new insights into the relationship between aspect ratio, self-attention mechanisms in vision transformers, and feature generation in ReID tasks.

Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID

Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification

Multi-attribute Adaptive Aggregation Transformer for Vehicle Re-Identification.

Person Re-identification Based on Transform Algorithm

A Video Is Worth Three Views: Trigeminal Transformers for Video-Based Person Re-Identification

V2ReID: Vision-Outlooker-Based Vehicle Re-Identification

Spatial-Channel Enhanced Transformer for Visible-Infrared Person Re-Identification

Dual-stream Transformer with Distribution Alignment for Visible-Infrared Person Re-Identification

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Exploiting Multi-view Part-wise Correlation via an Efficient Transformer for Vehicle Re-Identification

Bi-Level Implicit Semantic Data Augmentation for Vehicle Re-Identification

Parameter instance learning with enhanced vision transformers for aerial person re‐identification

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Viewpoint Alignment and Discriminative Parts Enhancement in 3D Space for Vehicle ReID

Optimizing ROI Benefits Vehicle ReID in ITS

Fine-grained Feature Alignment with Part Perspective Transformation for Vehicle ReID.

AIVR-Net: Attribute-based invariant visual representation learning for vehicle re-identification

VARID: Viewpoint-Aware Re-IDentification of Vehicle Based on Triplet Loss

PATReId: Pose Apprise Transformer Network for Vehicle Re-Identification

Multi-View Spatial Attention Embedding for Vehicle Re-Identification

PAMTRI: Pose-Aware Multi-Task Learning for Vehicle Re-Identification Using Highly Randomized Synthetic Data