Abstract:Prompt learning has become a popular approach for adapting large vision-language models, such as CLIP, to downstream tasks. Typically, prompt learning relies on a fixed prompt token or an input-conditional token to fit a small amount of data under full supervision. While this paradigm can generalize to a certain range of unseen classes, it may struggle when domain gap increases, such as in fine-grained classification and satellite image segmentation. To address this limitation, we propose Retrieval-enhanced Prompt learning (RePrompt), which introduces retrieval mechanisms to cache the knowledge representations from downstream tasks. we first construct a retrieval database from training examples, or from external examples when available. We then integrate this retrieval-enhanced mechanism into various stages of a simple prompt learning baseline. By referencing similar samples in the training set, the enhanced model is better able to adapt to new tasks with few samples. Our extensive experiments over 15 vision datasets, including 11 downstream tasks with few-shot setting and 4 domain generalization benchmarks, demonstrate that RePrompt achieves considerably improved performance. Our proposed approach provides a promising solution to the challenges faced by prompt learning when domain gap increases. The code and models will be available.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use retrieval - enhanced visual prompt learning methods in few - shot classification tasks to improve the performance of models in areas such as fine - grained classification and satellite image recognition, especially in the presence of domain gaps. Specifically, the paper proposes a method named Retrieval - enhanced Visual Prompt Learning (RePrompt), aiming to mitigate domain gaps by introducing a retrieval mechanism, thereby improving the performance of the model in downstream tasks.
### Background of the Paper and Problem Definition
The current Contrastive Language - Image Pretraining (CLIP) models have been widely used in various downstream visual tasks. To enhance their capabilities in these tasks, the few - shot learning paradigm is widely adopted. However, existing few - shot learning methods may encounter difficulties when dealing with fine - grained classification and satellite image recognition because the domain gaps become larger. Domain gaps refer to the differences in data distribution between the source domain and the target domain, which will lead to a decline in the generalization ability of the model on new domains.
### Proposed Method
To solve the above problems, the paper proposes the RePrompt method, which enhances visual prompt learning through the following steps:
1. **Constructing a Retrieval Database**:
- Extract features from training samples or external data to construct a retrieval database containing key - value pairs. Each key - value pair consists of an image representation and a label.
- Use stable diffusion techniques to generate additional training data to expand the retrieval database without using additional human resources for data collection and annotation.
2. **Retrieval - enhanced Mechanism**:
- During the inference process, given a query image, find the most similar samples through the retrieval database.
- Use the retrieved samples to generate dynamic visual prompts, which are inserted into multiple layers of the image encoder to enhance the representational ability of the model.
3. **Fusion Mechanism**:
- Generate a fusion vector by aggregating the features of the retrieved similar samples.
- Concatenate the query vector, the fusion vector, and the retrieved vectors and pass them as input to the visual prompt learner to generate retrieval - enhanced visual prompts.
4. **Final Prediction**:
- During the inference process, combine the retrieval - enhanced visual prompts and traditional prompt - tuning methods to generate the final prediction results.
- Use the non - parametric k - nearest neighbor (kNN) algorithm to calculate the similarity between the query instance and the samples in the database and generate prediction probabilities based on this.
### Experimental Results
The experimental results show that RePrompt has achieved state - of - the - art performance on a variety of visual datasets, including 11 image datasets, 3 video datasets, 1 multi - view dataset, and 4 domain generalization benchmark tests. These results indicate that RePrompt not only performs well in few - shot classification tasks but also has strong generalization ability and can achieve good performance on unseen domains.
### Main Contributions
1. **Proposed a retrieval - enhanced visual prompt learning method**, which significantly improves the performance of the model in few - shot classification tasks by constructing a retrieval database and introducing a retrieval mechanism.
2. **Explored the feasibility of introducing a retrieval system in visual language models**, dynamically selecting relevant references, which significantly improves the performance of the model in downstream tasks.
3. **Demonstrated the flexibility of this method in different tasks**, such as video understanding and multi - view recognition tasks.
4. **Achieved state - of - the - art performance in a variety of datasets and few - shot settings**, and performed well in domain generalization benchmark tests.
In conclusion, this paper successfully solves the domain gap problem in few - shot classification tasks by introducing a retrieval - enhanced mechanism, providing new ideas for improving the performance of visual language models in practical applications.