Abstract:Fine-grained visual categorization (FGVC) aims to discriminate similar subcategories that belong to the same superclass. Since the distinctions among similar subcategories are quite subtle and local, it is highly challenging to distinguish them from each other even for humans. So the localization of distinctions is essential for fine-grained visual categorization, and there are two pivotal problems: (1) Which regions are discriminative and representative to distinguish from other subcategories? (2) How many discriminative regions are necessary to achieve the best categorization performance? It is still difficult to address these two problems adaptively and intelligently. Artificial prior and experimental validation are widely used in existing mainstream methods to discover which and how many regions to gaze. However, their applications extremely restrict the usability and scalability of the methods. To address the above two problems, this paper proposes a multi-scale and multi-granularity deep reinforcement learning approach (M2DRL), which learns multi-granularity discriminative region attention and multi-scale region-based feature representation. Its main contributions are as follows: (1) Multi-granularity discriminative localization is proposed to localize the distinctions via a two-stage deep reinforcement learning approach, which discovers the discriminative regions with multiple granularities in a hierarchical manner (which problem), and determines the number of discriminative regions in an automatic and adaptive manner (how many problem). (2) Multi-scale representation learning helps to localize regions in different scales as well as encode images in different scales, boosting the fine-grained visual categorization performance. (3) Semantic reward function is proposed to drive M2DRL to fully capture the salient and conceptual visual information, via jointly considering attention and category information in the reward function. It allows the deep reinforcement learning to localize the distinctions in a weakly supervised manner or even an unsupervised manner. (4) Unsupervised discriminative localization is further explored to avoid the heavy labor consumption of annotating, and extremely strengthen the usability and scalability of our M2DRL approach. Compared with state-of-the-art methods on two widely-used fine-grained visual categorization datasets, our M2DRL approach achieves the best categorization accuracy.

Multiresolution Discriminative Mixup Network for Fine-Grained Visual Categorization.

Fine-Grained Visual Categorization With Fine-Tuned Segmentation

Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions

Hierarchical Gate Network for Fine-Grained Visual Recognition.

Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual Categorization

Multi-directional guidance network for fine-grained visual classification

Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained Visual Categorization

Which and How Many Regions to Gaze: Focus Discriminative Regions for Fine-Grained Visual Categorization

Fused One-Vs-all Mid-Level Features for Fine-Grained Visual Categorization

Fine-Grained Classification via Mixture of Deep Convolutional Neural Networks

Channel Interaction Networks for Fine-Grained Image Categorization

Few-Shot Fine-Grained Image Classification via Multi-Frequency Neighborhood and Double-Cross Modulation

R2-Trans:Fine-Grained Visual Categorization with Redundancy Reduction

Vision Mamba Distillation for Low-resolution Fine-grained Image Classification

Multi-level Dictionary Learning for Fine-Grained Images Categorization with Attention Model

Evolving Convolutional Neural Network And Its Application In Fine-Grained Visual Categorization

Learning Regions and Descriptors for Fine-grained Recognition

Data-free Knowledge Distillation for Fine-grained Visual Categorization

Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples

Semantic Segmentation of Very-High-Resolution Remote Sensing Images via Deep Multi-Feature Learning

DSP: Discriminative Spatial Part Modeling for Fine-Grained Visual Categorization