Abstract:Few-shot learning for fine-grained image classification has gained recent attention in computer vision. Among the approaches for few-shot learning, due to the simplicity and effectiveness, metric-based methods are favorably state-of-the-art on many tasks. Most of the metric-based methods assume a single similarity measure and thus obtain a single feature space. However, if samples can simultaneously be well classified via two distinct similarity measures, the samples within a class can distribute more compactly in a smaller feature space, producing more discriminative feature maps. Motivated by this, we propose a so-called Bi-Similarity Network (BSNet) that consists of a single embedding module and a bi-similarity module of two similarity measures. After the support images and the query images pass through the convolution-based embedding module, the bi-similarity module learns feature maps according to two similarity measures of diverse characteristics. In this way, the model is enabled to learn more discriminative and less similarity-biased features from few shots of fine-grained images, such that the model generalization ability can be significantly improved. Through extensive experiments by slightly modifying established metric/similarity based networks, we show that the proposed approach produces a substantial improvement on several fine-grained image benchmark datasets. Codes are available at: https://github.com/PRIS-CV/BSNet.
What problem does this paper attempt to address?
This paper attempts to address the problem of how to improve the generalization ability of the model and the discriminative power of features in few-shot fine-grained image classification by using two different similarity measures. Specifically, the paper proposes a Bi-Similarity Network (BSNet), which aims to combine two different similarity measurement methods (e.g., Euclidean distance and cosine distance) to enable the model to map samples of the same category more compactly in a smaller feature space, thereby generating more discriminative feature representations.
### Background and Motivation
In few-shot learning, especially in fine-grained image classification tasks, existing metric-based methods usually assume a single similarity measure, which may lead to insufficient generalization ability of the model in small sample cases. The paper points out that if two different similarity measures can be used simultaneously, samples within the same category can be distributed more compactly in the feature space, thereby generating more discriminative feature maps.
### Method Overview
The proposed BSNet consists of two parts:
1. **Embedding Module**: Uses convolutional neural networks to generate feature representations of support images and query images.
2. **Bi-Similarity Module**: Contains two similarity measurement branches that respectively calculate the similarity scores between the query image and each category.
### Training Process
During meta-training, for each task, the query image generates two similarity scores through the two similarity measurement branches, and then two predicted labels are generated based on these two scores. The loss function is the average of the loss values of the two branches, which is used to update the network parameters through backpropagation.
### Validation and Testing Process
During validation and testing, the query image is assigned to the category with the highest average similarity score, and the corresponding one-hot encoding vector is generated.
### Experimental Results
The paper conducts experiments on multiple fine-grained image classification benchmark datasets, including FGVC-Aircraft, Stanford-Cars, Stanford-Dogs, and CUB-200-2011. The experimental results show that BSNet significantly improves the performance of few-shot classification on these datasets.
### Main Contributions
1. Proposes a Bi-Similarity Network (BSNet) that combines two similarity measures, significantly improving the performance of four state-of-the-art few-shot classification methods on four fine-grained image datasets.
2. Demonstrates that the model complexity of BSNet is lower than the average complexity of two single-similarity networks, despite BSNet containing more model parameters.
3. Visualizes that BSNet can learn the discriminative regions of the input images.
In summary, this paper effectively improves the performance of few-shot fine-grained image classification tasks by introducing dual similarity measures, providing new ideas and methods for research in this field.