Abstract:Fine-grained visual categorization (FGVC) aims at recognizing objects from similar subordinate categories, which is challenging and practical for human's accurate automatic recognition needs. Most FGVC approaches focus on the attention mechanism research for discriminative regions mining while neglecting their interdependencies and composed holistic object structure, which are essential for model's discriminative information localization and understanding ability. To address the above limitations, we propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning to contain both the appearance information and structure information. Specifically, we encode the image into a sequence of patch tokens and build a strong vision transformer framework with two well-designed modules: (i) the structure information learning (SIL) module is proposed to mine the spatial context relation of significant patches within the object extent with the help of the transformer's self-attention weights, which is further injected into the model for importing structure information; (ii) the multi-level feature boosting (MFB) module is introduced to exploit the complementary of multi-level features and contrastive learning among classes to enhance feature robustness for accurate recognition. The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily, which only depends on the attention weights that come with the vision transformer itself. Extensive experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks. The code is available at https://github.com/PKU-ICST-MIPL/SIM-Trans_ACMMM2022.

Dual Transformer with Multi-Grained Assembly for Fine-Grained Visual Classification

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

TransFG: A Transformer Architecture for Fine-Grained Recognition

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

AA-Trans: Core Attention Aggregating Transformer with Information Entropy Selector for Fine-grained Visual Classification

FET-FGVC: Feature-enhanced transformer for fine-grained visual classification

Attention-based Multi-scale ViT Fine-grained Visual Classification

A novel dual-granularity lightweight transformer for vision tasks

A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION

Part-Guided Relational Transformers for Fine-Grained Visual Recognition

Multi-level information fusion Transformer with background filter for fine-grained image recognition

Dual Aggregation Transformer for Image Super-Resolution

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

Dual Path Transformer with Partition Attention

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization.

Transformer Based Multi-Grained Features for Unsupervised Person Re-Identification

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion