Abstract:Learning generalized representations from limited training samples is crucial for applying deep neural networks in low-resource scenarios. Recently, methods based on contrastive language-image pretraining (CLIP) have exhibited promising performance in few-shot adaptation tasks. To avoid catastrophic forgetting and overfitting caused by few-shot fine-tuning, existing works usually freeze the parameters of CLIP pretrained on large-scale datasets, overlooking the possibility that some parameters might not be suitable for downstream tasks. To this end, we revisit CLIP's visual encoder with a specific focus on its distinctive attention pooling layer, which performs a spatial weighted-sum of the dense feature maps. Given that dense feature maps contain meaningful semantic information, and different semantics hold varying importance for diverse downstream tasks (such as prioritizing semantics like ears and eyes in pet classification tasks rather than side mirrors), using the same weighted-sum operation for dense features across different few-shot tasks might not be appropriate. Hence, we propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics. In the inference process, we perform residual blending between the features pooled by the fine-tuned and the original attention pooling layers to incorporate both the few-shot knowledge and the pretrained CLIP's prior knowledge. We term this method as semantic-aware fine-tuning (). is effective in enhancing the conventional few-shot CLIP and is compatible with the existing adapter approach (termed ). Extensive experiments on 11 benchmarks demonstrate that both and significantly outperform the second-best method by + 1.51 % and + 2.38 % in the one-shot setting and by + 0.48 % and + 1.37 % in the four-shot setting, respectively.

Few-Shot Adaptation for Multimedia Semantic Indexing

Instance-Level Embedding Adaptation for Few-Shot Learning

Adaptive Cross-Modal Few-Shot Learning

Simple Semantic-Aided Few-Shot Learning

Less is More: A Closer Look at Semantic-based Few-Shot Learning

Transductive Episodic-Wise Adaptive Metric for Few-Shot Learning

Iterative Few-shot Semantic Segmentation from Image Label Text

Semantic-Based Few-Shot Learning by Interactive Psychometric Testing

Few-Shot Learning Based on Deep Learning for Image Classification

Few-shot Adaptation of Multi-modal Foundation Models: A Survey

Discriminative Sample-Guided and Parameter-Efficient Feature Space Adaptation for Cross-Domain Few-Shot Learning

EFSA: Episodic Few-Shot Adaptation for Text-to-Image Retrieval

Semantic Prompt for Few-Shot Image Recognition

Semantic-Based Implicit Feature Transform for Few-Shot Classification

Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning

A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models

Few-Shot Segmentation Without Meta-Learning: A Good Transductive Inference Is All You Need?

VSA: Adaptive Visual and Semantic Guided Attention on Few-Shot Learning

Learning Embedding Adaptation for Few-Shot Learning

Few-Shot Learning via Embedding Adaptation With Set-to-Set Functions