Abstract:Fine-grained image recognition is challenging because discriminative clues are usually fragmented, whether from a single image or multiple images. Despite their significant improvements, the majority of existing methods still focus on the most discriminative parts from a single image, ignoring informative details in other regions and lacking consideration of clues from other associated images. In this paper, we analyze the difficulties of fine-grained image recognition from a new perspective and propose a transformer architecture with the peak suppression module and knowledge guidance module, which respects the diversification of discriminative features in a single image and the aggregation of discriminative clues among multiple images. Specifically, the peak suppression module first utilizes a linear projection to convert the input image into sequential tokens. It then blocks the token based on the attention response generated by the transformer encoder. This module penalizes the attention to the most discriminative parts in the feature learning process, therefore, enhancing the information exploitation of the neglected regions. The knowledge guidance module compares the image-based representation generated from the peak suppression module with the learnable knowledge embedding set to obtain the knowledge response coefficients. Afterwards, it formalizes the knowledge learning as a classification problem using response coefficients as the classification scores. Knowledge embeddings and image-based representations are updated during training simultaneously so that the knowledge embedding includes a large number of discriminative clues for different images of the same category. Finally, we incorporate the acquired knowledge embeddings into the image-based representations as comprehensive representations, leading to significantly higher recognition performance. Extensive evaluations on the six popular datasets demonstrate the advantage of the proposed method in performance. The source code and models will be available online after the acceptance of the paper.

A fine‐grained image classification method based on information interaction

A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification

Hybrid Granularities Transformer for Fine-Grained Image Recognition

Dual-Branch Feature Fusion Network Based Cross-Modal Enhanced CNN and Transformer for Hyperspectral and LiDAR Classification

Transformer with peak suppression and knowledge guidance for fine-grained image recognition

CNN and Transformer interaction network for hyperspectral image classification

Multi-Modal Image Fusion Via Deep Laplacian Pyramid Hybrid Network

HDCTfusion: Hybrid Dual-Branch Network Based on CNN and Transformer for Infrared and Visible Image Fusion

FET-FGVC: Feature-enhanced transformer for fine-grained visual classification

Dual Transformer with Multi-Grained Assembly for Fine-Grained Visual Classification

Double-branch feature fusion transformer for hyperspectral image classification

Joint Classification of Hyperspectral Images and LiDAR Data Based on Dual-Branch Transformer

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

Dual-Branch Adaptive Convolutional Transformer for Hyperspectral Image Classification

AA-Trans: Core Attention Aggregating Transformer with Information Entropy Selector for Fine-grained Visual Classification

Hyperspectral Image Classification Based on Interactive Transformer and CNN With Multilevel Feature Fusion Network

Improving Fine-grained Image Classification with Multimodal Information

A Dual-Branch Multiscale Transformer Network for Hyperspectral Image Classification

Multi-directional guidance network for fine-grained visual classification

A multimodal hyper-fusion transformer for remote sensing image classification

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification