Abstract:Learning generalized representations from limited training samples is crucial for applying deep neural networks in low-resource scenarios. Recently, methods based on contrastive language-image pretraining (CLIP) have exhibited promising performance in few-shot adaptation tasks. To avoid catastrophic forgetting and overfitting caused by few-shot fine-tuning, existing works usually freeze the parameters of CLIP pretrained on large-scale datasets, overlooking the possibility that some parameters might not be suitable for downstream tasks. To this end, we revisit CLIP's visual encoder with a specific focus on its distinctive attention pooling layer, which performs a spatial weighted-sum of the dense feature maps. Given that dense feature maps contain meaningful semantic information, and different semantics hold varying importance for diverse downstream tasks (such as prioritizing semantics like ears and eyes in pet classification tasks rather than side mirrors), using the same weighted-sum operation for dense features across different few-shot tasks might not be appropriate. Hence, we propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics. In the inference process, we perform residual blending between the features pooled by the fine-tuned and the original attention pooling layers to incorporate both the few-shot knowledge and the pretrained CLIP's prior knowledge. We term this method as semantic-aware fine-tuning (). is effective in enhancing the conventional few-shot CLIP and is compatible with the existing adapter approach (termed ). Extensive experiments on 11 benchmarks demonstrate that both and significantly outperform the second-best method by + 1.51 % and + 2.38 % in the one-shot setting and by + 0.48 % and + 1.37 % in the four-shot setting, respectively.

Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering.

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Few-Shot Model Adaptation for Customized Facial Landmark Detection, Segmentation, Stylization and Shadow Removal

SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification

Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning

PACIA: Parameter-Efficient Adapter for Few-Shot Molecular Property Prediction

Hybrid Consistency Training with Prototype Adaptation for Few-Shot Learning

Few-shot Learner Parameterization by Diffusion Time-steps

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

KNN Transformer with Pyramid Prompts for Few-Shot Learning

Learning Embedding Adaptation for Few-Shot Learning

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

PFMNet: Prototype-based feature mapping network for few-shot domain adaptation in medical image segmentation

Few-Shot Learning Based on Deep Learning for Image Classification

Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Black Box Few-Shot Adaptation for Vision-Language models

Few-shot Action Recognition with Captioning Foundation Models