Abstract:Fine-grained image classification is a branch of image classification. Recently, vision transformer has made excellent progress in the field of image recognition. Its self-attention mechanism can extract very effective image feature information. However, feeding fixed-size image blocks into the network introduces additional noise, which is detrimental to extract discriminative features for fine-grained images. The vision transformer's network model is large, making it difficult to utilize in practice. Moreover, many of today's fine-grained image classification methods focus on mining discriminative features while ignoring the connections within the image. To address these problems, we propose a novel method based on the lightweight TinyVit backbone network. Our approach utilizes the self-attention weight values of TinyVit as a guide to construct an effective object location (OL) module that cuts and enlarges the object area, providing the network with the opportunity to concentrate on the local object. Additionally, we employ the graph convolutional network (GCN) to create a spatial relationship feature learning (SRFL) module that captures spatial context information between image blocks in TinyVit with the help of the transformer's self-attention weights. OL and SRFL collaborate to jointly guide the classification task. The experimental results show that the proposed method achieved competitive performance, with the second-highest classification faccuracy on both the CUB-200–2011 and NABirds datasets. When tested on the Stanford Dogs dataset, our approach outperformed many popular methods. Our code is uploaded on https://github.com/hhhj1999/SRFL_OL .

Attention-based Multi-scale ViT Fine-grained Visual Classification

Fine-Grained Visual Categorization With Fine-Tuned Segmentation

Hybrid ViT-CNN Network for Fine-Grained Image Classification

Fine-Grained Image Classification Based on Cross-Attention Network

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

AA-Trans: Core Attention Aggregating Transformer with Information Entropy Selector for Fine-grained Visual Classification

TransFG: A Transformer Architecture for Fine-Grained Recognition

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION

FET-FGVC: Feature-enhanced transformer for fine-grained visual classification

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Dual Transformer with Multi-Grained Assembly for Fine-Grained Visual Classification

Multi-level information fusion Transformer with background filter for fine-grained image recognition

Fine-grained image classification based on TinyVit object location and graph convolution network

Multi-directional guidance network for fine-grained visual classification

Data Augmentation Vision Transformer for Fine-grained Image Classification

FasterViT: Fast Vision Transformers with Hierarchical Attention

Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual Categorization

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification