Variational Feature Imitation Conditioned on Visual Descriptions for Few-shot Fine-grained Recognition

Xin Lu,Yixuan Pan,Yichao Cao,Xin Zhou,Xiaobo Lu
DOI: https://doi.org/10.1109/tcsvt.2024.3495533
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:In few-shot fine-grained recognition (FS-FGR) tasks, the main challenge is to distinguish novel categories with high intra-class variations and low inter-class differences given scarce training data. Existing studies explore discriminative features through a compact network to avoid overfitting, while they achieve marginal performance gain owing to the limited representation capability. Motivated by the significant progress of the vision foundation model, we introduce it to describe visual attributes and boost the performance of the compact feature extractor. A few-shot fine-grained recognition method with Variational Feature Imitation Conditioned on Visual Descriptions, VFI-CVD for short, has been proposed in this paper. It simultaneously exploits the pre-trained knowledge from a vision foundation model and the expert knowledge mined by a feature extractor. Specifically, the intra-class variations shared across object categories are encoded into a common distribution thus we can augment features by sampling latent variables. To enhance the learning of intra-class variations, a condition exchange strategy (CES) is put forward to interact the knowledge between samples through feature cross-imitation. In the inference stage, the learned knowledge is further integrated through the joint prediction of visual descriptions and cross-imitated features. Comprehensive experimental results on four fine-grained benchmark datasets show that the proposed VFI-CVD achieves state-of-the-art performance, e.g., 90.37% under the 5-way 1-shot setting on CUB-200-2011. It surpasses existing methods by a large margin, especially in the challenging 30-way recognition tasks and cross-domain evaluation. The source code is publicly available: https://github.com/Lx-zjwf/VFI-CVD.
What problem does this paper attempt to address?