Abstract:In fine-grained image recognition (FGIR), the localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches. The recently developed vision transformer (ViT) has achieved promising results on computer vision tasks. Compared with CNNs, Image sequentialization is a brand new manner. However, ViT is limited in its receptive field size and thus lacks local attention like CNNs due to the fixed size of its patches, and is unable to generate multi-scale features to learn discriminative region attention. To facilitate the learning of discriminative region attention without box/part annotations, we use the strength of the attention weights to measure the importance of the patch tokens corresponding to the raw images. We propose the recurrent attention multi-scale transformer (RAMS-Trans), which uses the transformer's self-attention to recursively learn discriminative region attention in a multi-scale manner. Specifically, at the core of our approach lies the dynamic patch proposal module (DPPM) guided region amplification to complete the integration of multi-scale image patches. The DPPM starts with the full-size image patches and iteratively scales up the region attention to generate new patches from global to local by the intensity of the attention weights generated at each scale as an indicator. Our approach requires only the attention weights that come with ViT itself and can be easily trained end-to-end. Extensive experiments demonstrate that RAMS-Trans performs better than concurrent works, in addition to efficient CNN models, achieving state-of-the-art results on three benchmark datasets.

GRVT: Toward Effective Grocery Recognition Via Vision Transformer

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

TransFG: A Transformer Architecture for Fine-Grained Recognition

Fine-Grained Grocery Product Recognition by One-Shot Learning.

A Sequence-selective Fine-grained Image Recognition Strategy Using Vision Transformer

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

Food Recognition with Visual Transformers

LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition

CF-ViT: A General Coarse-to-Fine Method for Vision Transformer

A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION

Fine-grained image classification based on TinyVit object location and graph convolution network

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

A novel dual-granularity lightweight transformer for vision tasks

SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

Attention-based Multi-scale ViT Fine-grained Visual Classification

Multimodal fine-grained grocery product recognition using image and OCR text

GFPE-ViT: vision transformer with geometric-fractal-based position encoding

RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained Image Recognition

Multi-level information fusion Transformer with background filter for fine-grained image recognition

AA-Trans: Core Attention Aggregating Transformer with Information Entropy Selector for Fine-grained Visual Classification