Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia

Mohammad Fahes,Tuan-Hung Vu,Andrei Bursuc,Patrick Pérez,Raoul de Charette
2024-10-08
Abstract:We consider the problem of adapting a contrastively pretrained vision-language model like CLIP (Radford et al., 2021) for few-shot classification. The existing literature addresses this problem by learning a linear classifier of the frozen visual features, optimizing word embeddings, or learning external feature adapters. This paper introduces an alternative way for CLIP adaptation without adding 'external' parameters to optimize. We find that simply fine-tuning the last projection matrix of the vision encoder leads to strong performance compared to the existing baselines. Furthermore, we show that regularizing training with the distance between the fine-tuned and pretrained matrices adds reliability for adapting CLIP through this layer. Perhaps surprisingly, this approach, coined ProLIP, yields performances on par or better than state of the art on 11 few-shot classification benchmarks, few-shot domain generalization, cross-dataset transfer and test-time adaptation. Code will be made available at <a class="link-external link-https" href="https://github.com/astra-vision/ProLIP" rel="external noopener nofollow">this https URL</a> .
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of how to effectively adapt the CLIP model to improve classification performance with a small number of labeled samples. Specifically, the paper proposes a simple and effective method—ProLIP (Projection Layer Fine-tuning), which achieves this goal by fine-tuning only the projection matrix of the last layer of the visual encoder. The paper points out that existing methods such as linear classifiers, optimizing word embeddings, or learning external feature adapters have some limitations, such as slow training speed and the need for additional parameter design. In contrast, ProLIP addresses these issues in the following ways: 1. **Simplified model parameters**: Only the projection matrix of the last layer is adjusted, without introducing additional parameters. 2. **Fast training**: Since only a single matrix needs to be fine-tuned, the training speed is very fast. 3. **Utilizing text embeddings**: Text embeddings are used as classification weights during fine-tuning, consistent with the pre-training mechanism of CLIP. 4. **Regularization strategy**: A regularization term is added to constrain the distance between the fine-tuned matrix and the pre-trained matrix to prevent overfitting. Experimental results show that ProLIP performs well in few-shot classification tasks on multiple datasets and is competitive in cross-dataset generalization and test-time adaptation. Additionally, the paper explores performance under a no-validation set setting and demonstrates how to select appropriate hyperparameters in few-shot scenarios.