Abstract:Vehicle recognition technology is widely applied in automatic parking, traffic restrictions, and public security investigations, playing a significant role in the construction of intelligent transportation systems. Fine-grained vehicle recognition seeks to surpass conventional vehicle recognition by concentrating on more detailed sub-classifications. This task is more challenging due to the subtle inter-class differences and significant intra-class variations.Localization-classification subnetworks represent an efficacious approach frequently employed for this task, but previous research has typically relied on CNN deep feature maps for object localization, which suffer from the low resolution, leading to poor localization accuracy. The multi-layer feature fusion localization (MFFL) method proposed by us fuses the high-resolution feature map of the shallow layer of CNN with the deep feature map, and makes full use of the rich spatial information of the shallow feature map to achieve more precise object localization. In addition, traditional methods acquire local attention information through the design of complex models, frequently resulting in regional redundancy or information omission. To address this, we introduce an attention module that adaptively enhances the expressiveness of global features and generates global attention features. These global attention features are then integrated with object-level features and local attention cues to achieve a more comprehensive attention enhancement. Lastly, we devise a multi-branch model and employ the aforementioned object localization and attention enhancement methods for end-to-end training to make the multiple branches collaborate seamlessly to adequately extract fine-grained features. Extensive experiments conducted on the Stanford Cars dataset and the self-built Cars-126 dataset have demonstrated the effectiveness of our method, achieving a leading position among existing methods with 97.7% classification accuracy on the Stanford Cars dataset.

Transformer Based Multi-modal Fusion for Place Recognition with Self-attention Mechanism

Camera-LiDAR Fusion with Latent Contact for Place Recognition in Challenging Cross-Scenes

Multi-Modal Image Fusion Via Deep Laplacian Pyramid Hybrid Network

MFF-PR: Point Cloud and Image Multi-modal Feature Fusion for Place Recognition.

LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition

Hybrid CNN-Transformer Features for Visual Place Recognition

Transformer-Based Cross-Modal Information Fusion Network for Semantic Segmentation

Large-Scale Place Recognition Based on Camera-LiDAR Fused Descriptor

Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

MFST: Multi-Modal Feature Self-Adaptive Transformer for Infrared and Visible Image Fusion

Multi-Stage Residual Fusion Network for LIDAR-Camera Road Detection

HATF: Multi-Modal Feature Learning for Infrared and Visible Image Fusion via Hybrid Attention Transformer

MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition

Multi-layer feature fusion and attention enhancement for fine-grained vehicle recognition research

A Generalized Multi-Modal Fusion Detection Framework

Learnable fusion mechanisms for multimodal object detection in autonomous vehicles

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

Multi-Modality Cascaded Fusion Technology for Autonomous Driving

Mutually Beneficial Transformer for Multimodal Data Fusion

Sensor Fusion by Spatial Encoding for Autonomous Driving