Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Jinda Lu,Shuo Wang,Yanbin Hao,Haifeng Liu,Xiang Wang,Meng Wang
2024-07-19
Abstract:Recent adaptations can boost the low-shot capability of Contrastive Vision-Language Pre-training (CLIP) by effectively facilitating knowledge transfer. However, these adaptation methods are usually operated on the global view of an input image, and thus biased perception of partial local details of the image. To solve this problem, we propose a Visual Content Refinement (VCR) before the adaptation calculation during the test stage. Specifically, we first decompose the test image into different scales to shift the feature extractor's attention to the details of the image. Then, we select the image view with the max prediction margin in each scale to filter out the noisy image views, where the prediction margins are calculated from the pre-trained CLIP model. Finally, we merge the content of the aforementioned selected image views based on their scales to construct a new robust representation. Thus, the merged content can be directly used to help the adapter focus on both global and local parts without any extra training parameters. We apply our method to 3 popular low-shot benchmark tasks with 13 datasets and achieve a significant improvement over state-of-the-art methods. For example, compared to the baseline (Tip-Adapter) on the few-shot classification task, our method achieves about 2\% average improvement for both training-free and training-need settings.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the perceptual bias problem existing in the existing low - shot CLIP adaptation methods. Specifically, the existing CLIP adaptation methods usually operate on the global view of the input image, resulting in a biased perception of some local details of the image. This bias makes the model easily overlook the overall appearance of the object and focus too much on certain small components or environmental noise, thus affecting the prediction accuracy of the model. To solve this problem, the author proposes the Visual Content Refinement (VCR) method. The main contributions of VCR are as follows: 1. **Multi - scale Decomposition**: Decompose the input image into multiple scales to help the adaptation process alleviate the perceptual bias problem in CLIP by discarding irrelevant noise and retaining more local details. 2. **Content Refinement Module**: Design a refinement module to actively select the most relevant multi - scale content and combine different image views to optimize the image representation. This further enhances the adaptation process. 3. **Experimental Verification**: Experiments were carried out in three low - shot image recognition tasks, using 13 benchmark datasets, and the results show that this method is significantly superior to the current methods. ### Specific Problem Description The existing CLIP adaptation methods mainly have the following two perceptual bias problems: - **Component Bias**: The adaptation model tends to focus on some small components of the object and ignores the overall appearance of the object. For example, when processing a bicycle image, the model may pay more attention to the front fender and ignore the overall structure of the bicycle. - **Environmental Bias**: The adaptation model tends to give priority to environmental noise rather than the object itself. For example, when processing a bird image, the model may pay more attention to the branch rather than the bird itself. The root cause of these problems lies in the lack of a comprehensive description of the image, resulting in insufficient or excessive attention to specific local details. To solve these problems, the author introduces a multi - scale representation method, combining image views at different scales to ensure that the model can pay attention to both global and local information simultaneously. ### Solution The VCR method proposed by the author mainly includes the following steps: 1. **Visual Decomposition**: Decompose the input image into multiple scales. The smaller scales retain more local details, while the larger scales contain more structural content. 2. **Content Refinement**: Use the pre - trained CLIP model to calculate the prediction scores of the image at each scale, then apply the maximum margin criterion to filter out noisy image views and retain the image views with the largest prediction margin. Finally, merge the visual features of the selected image views according to the scale to fuse different contents of the input image. Through these steps, the VCR method can enhance the effect of the low - shot CLIP adaptation method without increasing additional training parameters, and improve the accuracy and robustness of image recognition.