Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

Tian Liu,Huixin Zhang,Shubham Parashar,Shu Kong
2024-11-24
Abstract:Few-shot recognition (FSR) aims to train a classification model with only a few labeled examples of each concept concerned by a downstream task, where data annotation cost can be prohibitively high. We develop methods to solve FSR by leveraging a pretrained Vision-Language Model (VLM). We particularly explore retrieval-augmented learning (RAL), which retrieves data from the VLM's pretraining set to learn better models for serving downstream tasks. RAL has been widely studied in zero-shot recognition but remains under-explored in FSR. Although applying RAL to FSR may seem straightforward, we observe interesting and novel challenges and opportunities. First, somewhat surprisingly, finetuning a VLM on a large amount of retrieved data underperforms state-of-the-art zero-shot methods. This is due to the imbalanced distribution of retrieved data and its domain gaps with the few-shot examples in the downstream task. Second, more surprisingly, we find that simply finetuning a VLM solely on few-shot examples significantly outperforms previous FSR methods, and finetuning on the mix of retrieved and few-shot data yields even better results. Third, to mitigate the imbalanced distribution and domain gap issues, we propose Stage-Wise retrieval-Augmented fineTuning (SWAT), which involves end-to-end finetuning on mixed data in the first stage and retraining the classifier on the few-shot data in the second stage. Extensive experiments on nine popular benchmarks demonstrate that SWAT significantly outperforms previous methods by $>$6% accuracy.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the task of image classification with a small number of labeled samples, namely Few - Shot Recognition (FSR). Specifically, the author focuses on how to utilize pre - trained Vision - Language Models (VLMs) and their pre - trained data to improve the performance of the FSR task. The paper points out that in practical applications, such as automated data labeling, the cost of obtaining a large amount of labeled data is very high. Therefore, it is necessary to develop methods that can train efficient classification models on a small number of samples. The author explores the application of Retrieval - Augmented Learning (RAL) methods in FSR and proposes a new two - stage method - Stage - Wise retrieval - Augmented fineTuning (SWAT) to solve the problems of unbalanced distribution of retrieval data and domain gaps. Through this method, the author hopes to significantly improve the performance of FSR on multiple benchmark datasets.