Abstract:Recent adaptations can boost the low-shot capability of Contrastive Vision-Language Pre-training (CLIP) by effectively facilitating knowledge transfer. However, these adaptation methods are usually operated on the global view of an input image, and thus biased perception of partial local details of the image. To solve this problem, we propose a Visual Content Refinement (VCR) before the adaptation calculation during the test stage. Specifically, we first decompose the test image into different scales to shift the feature extractor's attention to the details of the image. Then, we select the image view with the max prediction margin in each scale to filter out the noisy image views, where the prediction margins are calculated from the pre-trained CLIP model. Finally, we merge the content of the aforementioned selected image views based on their scales to construct a new robust representation. Thus, the merged content can be directly used to help the adapter focus on both global and local parts without any extra training parameters. We apply our method to 3 popular low-shot benchmark tasks with 13 datasets and achieve a significant improvement over state-of-the-art methods. For example, compared to the baseline (Tip-Adapter) on the few-shot classification task, our method achieves about 2\% average improvement for both training-free and training-need settings.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the perceptual bias problem existing in the existing low - shot CLIP adaptation methods. Specifically, the existing CLIP adaptation methods usually operate on the global view of the input image, resulting in a biased perception of some local details of the image. This bias makes the model easily overlook the overall appearance of the object and focus too much on certain small components or environmental noise, thus affecting the prediction accuracy of the model. To solve this problem, the author proposes the Visual Content Refinement (VCR) method. The main contributions of VCR are as follows: 1. **Multi - scale Decomposition**: Decompose the input image into multiple scales to help the adaptation process alleviate the perceptual bias problem in CLIP by discarding irrelevant noise and retaining more local details. 2. **Content Refinement Module**: Design a refinement module to actively select the most relevant multi - scale content and combine different image views to optimize the image representation. This further enhances the adaptation process. 3. **Experimental Verification**: Experiments were carried out in three low - shot image recognition tasks, using 13 benchmark datasets, and the results show that this method is significantly superior to the current methods. ### Specific Problem Description The existing CLIP adaptation methods mainly have the following two perceptual bias problems: - **Component Bias**: The adaptation model tends to focus on some small components of the object and ignores the overall appearance of the object. For example, when processing a bicycle image, the model may pay more attention to the front fender and ignore the overall structure of the bicycle. - **Environmental Bias**: The adaptation model tends to give priority to environmental noise rather than the object itself. For example, when processing a bird image, the model may pay more attention to the branch rather than the bird itself. The root cause of these problems lies in the lack of a comprehensive description of the image, resulting in insufficient or excessive attention to specific local details. To solve these problems, the author introduces a multi - scale representation method, combining image views at different scales to ensure that the model can pay attention to both global and local information simultaneously. ### Solution The VCR method proposed by the author mainly includes the following steps: 1. **Visual Decomposition**: Decompose the input image into multiple scales. The smaller scales retain more local details, while the larger scales contain more structural content. 2. **Content Refinement**: Use the pre - trained CLIP model to calculate the prediction scores of the image at each scale, then apply the maximum margin criterion to filter out noisy image views and retain the image views with the largest prediction margin. Finally, merge the visual features of the selected image views according to the scale to fuse different contents of the input image. Through these steps, the VCR method can enhance the effect of the low - shot CLIP adaptation method without increasing additional training parameters, and improve the accuracy and robustness of image recognition.

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning

Ta-Adapter: Enhancing few-shot CLIP with task-aware encoders

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia

RCAT: Retentive CLIP Adapter Tuning for Improved Video Recognition

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

How Much Can CLIP Benefit Vision-and-Language Tasks?

Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation

A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation

Learning to Adapt Category Consistent Meta-Feature of CLIP for Few-Shot Classification

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification

Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition

Is Less More? Exploring Token Condensation as Training-free Adaptation for CLIP