Hamed Damirchi,Cristian Rodríguez-Opazo,Ehsan Abbasnejad,Damien Teney,Javen Qinfeng Shi,Stephen Gould,Anton van den Hengel
Abstract:Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. The Web likely contains the information necessary to excel on any specific application, but identifying the right data a priori is challenging. This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval. We propose to retrieve useful data from the Web at test time based on test cases that the model is uncertain about. Different from existing retrieval-augmented approaches, we then update the model to address this underlying uncertainty. We demonstrate substantial improvements in zero-shot performance, e.g. a remarkable increase of 15 percentage points in accuracy on the Stanford Cars and Flowers datasets. We also present extensive experiments that explore the impact of noisy retrieval and different learning strategies.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem of insufficient performance of pre - trained models when dealing with domain - specific tasks due to the lack of domain - specific details. Although large pre - trained models can significantly reduce the amount of task - specific data required to solve specific problems, they usually cannot directly capture the nuances of specific domains. Meanwhile, the Internet may contain the necessary information to solve any specific application, but it is challenging to identify these correct data in advance.
To this end, the author proposes a novel method to enhance pre - trained models by combining search engine retrieval. Specifically, this method retrieves useful data from the network according to the test cases where the model is uncertain during testing, and updates the model based on these retrieved data to deal with this uncertainty. This method is especially suitable for zero - shot learning and has achieved significant performance improvements on the Stanford Cars and Flowers datasets (for example, the accuracy rate has increased by 15 percentage points).
### Main contributions of the paper
1. **Innovatively combining search engines**: A new method is proposed to enrich pre - trained visual recognition models by accessing web searches during testing without additional labels or manual input. This method can be easily integrated into the current machine - learning pipeline and brings performance improvements.
2. **Uncertainty measurement based on classification entropy**: An uncertainty measurement method based on classification entropy is implemented on the basis of CLIP, making the amount of retrieved data more efficient. Technically, the projection of image embeddings onto the hypersphere is used, and the mixture of von Mises - Fisher distributions is used to characterize the distribution of underlying concepts.
3. **Extensive experimental verification**: A large number of experiments have been carried out to explore different implementation options. The results show that on the Stanford Cars and Flowers datasets, this method has brought significant performance improvements, with the accuracy rate increasing by more than 15 percentage points. In addition, a strong correlation between the observed improvements and the specific characteristics of the zero - shot tasks at hand is also shown, enabling us to actively evaluate the advantages of this method for the tasks.
### Method overview
1. **Identifying uncertain instances**: First, by calculating the prediction entropy of the pre - trained model for each sample in a given unlabeled target dataset, uncertain instances are identified.
2. **Constructing queries and retrieving relevant images**: For these uncertain categories, queries are constructed and search engines (such as Google) are called to retrieve relevant images. The retrieved dataset may be noisy and contain irrelevant images.
3. **Refining and filtering irrelevant images**: The retrieved data is refined and filtered to remove irrelevant images and ensure the quality of the final dataset used for training.
4. **Training a small model**: A small model (such as a linear probe) is trained using the refined retrieved dataset to improve prediction performance.
Through this method, the author has successfully combined the knowledge of pre - trained models with the real - time, dynamic information provided by search engines, thereby improving the performance of the model on specific tasks.
### Conclusion
This research shows a new method of effectively using network resources to enhance the performance of pre - trained models, especially performing well in zero - shot learning scenarios. This method can not only significantly improve model performance, but also provide new ideas and directions for future research.