Abstract:Fine-grained recognition involves the classification of images from subordinate macro-categories, and it is challenging due to small inter-class differences. To overcome this, most methods perform discriminative feature selection enabled by a feature extraction backbone followed by a high-level feature refinement step. Recently, many studies have shown the potential behind vision transformers as a backbone for fine-grained recognition, but their usage of its attention mechanism to select discriminative tokens can be computationally expensive. In this work, we propose a novel and computationally inexpensive metric to identify discriminative regions in an image. We compare the similarity between the global representation of an image given by the CLS token, a learnable token used by transformers for classification, and the local representation of individual patches. We select the regions with the highest similarity to obtain crops, which are forwarded through the same transformer encoder. Finally, high-level features of the original and cropped representations are further refined together in order to make more robust predictions. Through extensive experimental evaluation we demonstrate the effectiveness of our proposed method, obtaining favorable results in terms of accuracy across a variety of datasets. Furthermore, our method achieves these results at a much lower computational cost compared to the alternatives. Code and checkpoints are available at: \url{<a class="link-external link-https" href="https://github.com/arkel23/GLSim" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

This paper attempts to address the challenges in fine - grained image recognition (FGIR), especially the problem of small inter - class differences. Specifically, the authors propose a new method - Global - Local Similarity (GLS) - to efficiently select discriminative image regions. ### Research Background and Problems Fine - grained image recognition involves classifying sub - categories belonging to larger super - categories, such as distinguishing different bird species or anime characters. Due to the very small inter - class differences and large intra - class variations, the FGIR task is very challenging. To meet these challenges, existing methods usually use feature extraction backbone networks to select discriminative features and further optimize them through high - order feature refinement steps. However, recent studies have shown that although Vision Transformers (ViTs) show potential in fine - grained image recognition, their attention mechanisms are computationally expensive when used to select discriminative tokens. Especially when dealing with high - resolution images, the computational complexity can reach \(O(N^3)\), which limits the practicality of these methods. ### Proposed Method To solve the above problems, the authors propose the GLS method. GLS identifies discriminative regions by comparing the similarity between the global representation (provided by the CLS token) and the local representation (provided by each patch). The specific steps are as follows: 1. **Calculate Similarity**: Use the cosine similarity formula to calculate the similarity between the global representation and each local token: \[ s_i=\text{sim}(f_0, f_i)=\cos(f_0, f_i)=\frac{f_0\cdot f_i}{\|f_0\|\|f_i\|} \] where \(f_0\) is the CLS token and \(f_i\) is the \(i\) - th local token. 2. **Select Discriminative Regions**: According to the similarity scores, select the image regions corresponding to the top - \(O\) tokens with the highest similarity and crop these regions. 3. **Feature Fusion**: After processing the original image and the cropped image through the same Transformer encoder, use the Aggregator module to combine the high - level features of both, and finally output the prediction result through the classification head. ### Experimental Results Through experiments on multiple datasets, the authors prove the effectiveness and efficiency of the GLS method. Compared with other methods, GLS not only improves classification accuracy but also significantly reduces computational cost. Specifically: - On the NABirds dataset, the GLS method improves by 3.1% compared with the ViT baseline model and by 0.7% compared with the second - best Dual - TR method. - On the iNat17 dataset, the GLS method improves by 1.5% compared with DeiT - NET, and the relative improvement is more obvious. - In the comprehensive evaluation on 10 different datasets, the GLS method achieves the highest accuracy on 8 datasets and reduces the relative classification error by an average of 10.15%. ### Conclusion The GLS method provides an efficient and effective alternative and can be used as a tool for discriminative region selection in Vision Transformers, thereby improving the performance of fine - grained image recognition. In addition, the computational complexity of GLS is only linear \(O(N)\), much lower than \(O(N^3)\) of the traditional attention mechanism, making this method more feasible in practical applications.

Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

Local-to-Global Self-Attention in Vision Transformers

SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization.

LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition

LGFCTR: Local and Global Feature Convolutional Transformer for Image Matching

Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification

TransFG: A Transformer Architecture for Fine-Grained Recognition

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

Unifying Global-Local Representations in Salient Object Detection with Transformer

FET-FGVC: Feature-enhanced transformer for fine-grained visual classification

An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition

GLaLT: Global-Local Attention-Augmented Light Transformer for Scene Text Recognition

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

GLiT: Neural Architecture Search for Global and Local Image Transformer

Part-Guided Relational Transformers for Fine-Grained Visual Recognition

Global and Local Feature Interaction with Vision Transformer for Few-shot Image Classification

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

SG-Former: Self-guided Transformer with Evolving Token Reallocation