Abstract:Fine-grained visual categorization (FGVC) aims to discriminate similar subcategories that belong to the same superclass. Since the distinctions among similar subcategories are quite subtle and local, it is highly challenging to distinguish them from each other even for humans. So the localization of distinctions is essential for fine-grained visual categorization, and there are two pivotal problems: (1) Which regions are discriminative and representative to distinguish from other subcategories? (2) How many discriminative regions are necessary to achieve the best categorization performance? It is still difficult to address these two problems adaptively and intelligently. Artificial prior and experimental validation are widely used in existing mainstream methods to discover which and how many regions to gaze. However, their applications extremely restrict the usability and scalability of the methods. To address the above two problems, this paper proposes a multi-scale and multi-granularity deep reinforcement learning approach (M2DRL), which learns multi-granularity discriminative region attention and multi-scale region-based feature representation. Its main contributions are as follows: (1) Multi-granularity discriminative localization is proposed to localize the distinctions via a two-stage deep reinforcement learning approach, which discovers the discriminative regions with multiple granularities in a hierarchical manner (which problem), and determines the number of discriminative regions in an automatic and adaptive manner (how many problem). (2) Multi-scale representation learning helps to localize regions in different scales as well as encode images in different scales, boosting the fine-grained visual categorization performance. (3) Semantic reward function is proposed to drive M2DRL to fully capture the salient and conceptual visual information, via jointly considering attention and category information in the reward function. It allows the deep reinforcement learning to localize the distinctions in a weakly supervised manner or even an unsupervised manner. (4) Unsupervised discriminative localization is further explored to avoid the heavy labor consumption of annotating, and extremely strengthen the usability and scalability of our M2DRL approach. Compared with state-of-the-art methods on two widely-used fine-grained visual categorization datasets, our M2DRL approach achieves the best categorization accuracy.

Multi-level Dictionary Learning for Fine-Grained Images Categorization with Attention Model

Joint CRF and Locality-Consistent Dictionary Learning for Semantic Segmentation.

Dual attention guided multi-scale CNN for fine-grained image classification

Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual Categorization

Attention cutting and padding learning for fine-grained image recognition

Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition

Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition

The Application of Two-Level Attention Models in Deep Convolutional Neural Network for Fine-Grained Image Classification

Multi-level Discriminative Dictionary Learning with Application to Large Scale Image Classification.

Part Attentioned Fine-Grained Categorization Using Context Sensitive Flow

Weakly supervised fine-grained image classification via two-level attention activation model

Weakly Supervised Fine-Grained Image Recognition Based on Multi-Channel Attention and Object Localization

Fine-grained Category Discovery under Coarse-grained Supervision with Hierarchical Weighted Self-contrastive Learning

Which and How Many Regions to Gaze: Focus Discriminative Regions for Fine-Grained Visual Categorization

Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition

Subtler mixed attention network on fine-grained image classification

Multiple Granularity Descriptors for Fine-Grained Categorization.

Object-Part Attention Driven Discriminative Localization for Fine-grained Image Classification.

Cross-layer Attention Network for Fine-grained Visual Categorization

Learning Category-Specific Dictionary and Shared Dictionary for Fine-Grained Image Categorization

Fully Convolutional Attention Networks for Fine-Grained Recognition