Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia

Mohammad Fahes,Tuan-Hung Vu,Andrei Bursuc,Patrick Pérez,Raoul de Charette

2024-10-08

Abstract:We consider the problem of adapting a contrastively pretrained vision-language model like CLIP (Radford et al., 2021) for few-shot classification. The existing literature addresses this problem by learning a linear classifier of the frozen visual features, optimizing word embeddings, or learning external feature adapters. This paper introduces an alternative way for CLIP adaptation without adding 'external' parameters to optimize. We find that simply fine-tuning the last projection matrix of the vision encoder leads to strong performance compared to the existing baselines. Furthermore, we show that regularizing training with the distance between the fine-tuned and pretrained matrices adds reliability for adapting CLIP through this layer. Perhaps surprisingly, this approach, coined ProLIP, yields performances on par or better than state of the art on 11 few-shot classification benchmarks, few-shot domain generalization, cross-dataset transfer and test-time adaptation. Code will be made available at <a class="link-external link-https" href="https://github.com/astra-vision/ProLIP" rel="external noopener nofollow">this https URL</a> .

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the problem of how to effectively adapt the CLIP model to improve classification performance with a small number of labeled samples. Specifically, the paper proposes a simple and effective method—ProLIP (Projection Layer Fine-tuning), which achieves this goal by fine-tuning only the projection matrix of the last layer of the visual encoder. The paper points out that existing methods such as linear classifiers, optimizing word embeddings, or learning external feature adapters have some limitations, such as slow training speed and the need for additional parameter design. In contrast, ProLIP addresses these issues in the following ways: 1. **Simplified model parameters**: Only the projection matrix of the last layer is adjusted, without introducing additional parameters. 2. **Fast training**: Since only a single matrix needs to be fine-tuned, the training speed is very fast. 3. **Utilizing text embeddings**: Text embeddings are used as classification weights during fine-tuning, consistent with the pre-training mechanism of CLIP. 4. **Regularization strategy**: A regularization term is added to constrain the distance between the fine-tuned matrix and the pre-trained matrix to prevent overfitting. Experimental results show that ProLIP performs well in few-shot classification tasks on multiple datasets and is competitive in cross-dataset generalization and test-time adaptation. Additionally, the paper explores performance under a no-validation set setting and demonstrates how to select appropriate hyperparameters in few-shot scenarios.

Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia

Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning

Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

Adversarial Domain Adaptation with CLIP for Few-Shot Image Classification

A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation

CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models

Fine-Tuning for Few-shot Image Classification by Multimodal Prototype Regularization

CLIP Adaptation by Intra-modal Overlap Reduction

Robust Fine-Tuning of Vision-Language Models for Domain Generalization

Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners

Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

Fine-tuned CLIP Models are Efficient Video Learners

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

Ta-Adapter: Enhancing few-shot CLIP with task-aware encoders