Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID

Mei Qiu,Lauren Ann Christopher,Stanley Chien,Lingxi Li
2024-11-10
Abstract:Vision Transformers (ViTs) have shown exceptional performance in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video inputs can negatively impact re-identification accuracy. To address this challenge, we propose a novel, human perception driven, and general ViT-based ReID framework that fuses models trained on various aspect ratios. Our key contributions are threefold: (i) We analyze the impact of aspect ratios on performance using the VeRi-776 and VehicleID datasets, providing guidance for input settings based on the distribution of original image aspect ratios. (ii) We introduce patch-wise mixup strategy during ViT patchification (guided by spatial attention scores) and implement uneven stride for better alignment with object aspect ratios. (iii) We propose a dynamic feature fusion ReID network to enhance model robustness. Our method outperforms state-of-the-art transformer-based approaches on both datasets, with only a minimal increase in inference time per image.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the issue of the negative impact of images or video inputs with different aspect ratios on recognition accuracy in the Vehicle Re-identification (Vehicle ReID) task. Specifically, existing methods based on Vision Transformers (ViTs) may experience performance degradation when handling inputs with non-square aspect ratios. To tackle this challenge, the authors propose a novel, human perception-driven general ViT framework that enhances recognition robustness and accuracy by integrating models trained on different aspect ratios. ### Main Contributions: 1. **Analysis of Aspect Ratio Impact**: The authors analyzed the impact of aspect ratio on performance using the VeRi-776 and VehicleID datasets and provided input setting guidelines based on the distribution of original image aspect ratios. 2. **Introduction of Patch-level Mixing Strategy**: A patch-level mixing strategy based on spatial attention scores was introduced during the patching process of ViT, and non-uniform strides were implemented to better align with object aspect ratios. 3. **Dynamic Feature Fusion Network**: A dynamic feature fusion ReID network was proposed to enhance the model's robustness. ### Experimental Results: - On the VeRi-776 dataset, non-square inputs (224×298) improved mAP by 4.6%, and feature fusion further improved it by 6.5%. - On the VehicleID dataset, 384×396 inputs improved mAP by 0.6%, and feature fusion further improved it by 0.8% and R1 by 1.3%. - Compared to recent ViT-based methods, this model improved mAP by 2.5% on VeRi-776, and achieved 91.0% mAP, 86.3% R1, and 97.4% R5 on the largest test set of VehicleID, which are 8.4% R1 higher than pure ViT. ### Conclusion: By integrating ViT models trained on different aspect ratios, this method significantly enhances the robustness and performance of the vehicle re-identification task. The non-uniform stride patching preserves spatial structure, while the intra-patch mixing strategy improves generalization through random sampling. These results provide new insights into the relationship between aspect ratio, self-attention mechanisms in vision transformers, and feature generation in ReID tasks.