Abstract:With the continuous development of intelligent transportation systems, vehicle-related fields have emerged a research boom in detection, tracking, and retrieval. Vehicle re-identification aims to judge whether a specific vehicle appears in a video stream, which is a popular research direction. Previous researches have proven that the transformer is an efficient method in computer vision, which treats a visual image as a series of patch sequences. However, an efficient vehicle reidentification should consider the image feature and the attribute feature simultaneously. In this work, we propose a vehicle attribute transformer (VAT) for vehicle re-identification. First, we consider color and model as the most intuitive attributes of the vehicle, the vehicle color and model are relatively stable and easy to distinguish. Therefore, the color feature and the model feature are embedded in a transformer. Second, we consider that the shooting angle of each image may be different, so we encode the viewpoint of the vehicle image as another additional attribute. Besides, different attributes are supposed to have different importance. Based on this, we design a multi-attribute adaptive aggregation network, which can compare different attributes and assign different weights to the corresponding features. Finally, to optimize the proposed transformer network, we design a multi-sample dispersion triplet (MDT) loss. Not only the hardest samples based on hard mining strategy, but also some extra positive samples and negative samples are considered in this loss. The dispersion of multi-sample is utilized to dynamically adjust the loss, which can guide the network to learn more optimized division for feature space. Extensive experiments on popular vehicle re-identification datasets verify that the proposed method can achieve state-of-the-art performance.

Multi-Modal Virtual-Real Fusion based Transformer for Collaborative Perception

Multi-attribute Adaptive Aggregation Transformer for Vehicle Re-Identification.

ViT-FuseNet: MultiModal Fusion of Vision Transformer for Vehicle-Infrastructure Cooperative Perception

Collaborative Multimodal Fusion Network for Multiagent Perception

Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

HM-ViT: Hetero-modal Vehicle-to-Vehicle Cooperative perception with vision transformer

Collaborative Perception Method Based on Multisensor Fusion

Transformer Based Multi-modal Fusion for Place Recognition with Self-attention Mechanism

MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition

V2VFormer++: Multi-Modal Vehicle-to-Vehicle Cooperative Perception Via Global-Local Transformer

MMFN: Multi-Modal-Fusion-Net for End-to-End Driving

V2V Based Visual Cooperative Perception for Connected Autonomous Vehicles: Far-Sight and See-Through

Multi‐future Transformer: Learning Diverse Interaction Modes for Behaviour Prediction in Autonomous Driving

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer

Collaborative Multimodal Vehicular Transformer Training Using Federated Learning

V2VFusion: Multimodal Fusion for Enhanced Vehicle-to-Vehicle Cooperative Perception

Spatial-Temporal Multimodal End-to-End Autonomous Driving.

IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception

Multi-Modality Cascaded Fusion Technology for Autonomous Driving