Abstract:Learning generalized representations from limited training samples is crucial for applying deep neural networks in low-resource scenarios. Recently, methods based on contrastive language-image pretraining (CLIP) have exhibited promising performance in few-shot adaptation tasks. To avoid catastrophic forgetting and overfitting caused by few-shot fine-tuning, existing works usually freeze the parameters of CLIP pretrained on large-scale datasets, overlooking the possibility that some parameters might not be suitable for downstream tasks. To this end, we revisit CLIP's visual encoder with a specific focus on its distinctive attention pooling layer, which performs a spatial weighted-sum of the dense feature maps. Given that dense feature maps contain meaningful semantic information, and different semantics hold varying importance for diverse downstream tasks (such as prioritizing semantics like ears and eyes in pet classification tasks rather than side mirrors), using the same weighted-sum operation for dense features across different few-shot tasks might not be appropriate. Hence, we propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics. In the inference process, we perform residual blending between the features pooled by the fine-tuned and the original attention pooling layers to incorporate both the few-shot knowledge and the pretrained CLIP's prior knowledge. We term this method as semantic-aware fine-tuning (). is effective in enhancing the conventional few-shot CLIP and is compatible with the existing adapter approach (termed ). Extensive experiments on 11 benchmarks demonstrate that both and significantly outperform the second-best method by + 1.51 % and + 2.38 % in the one-shot setting and by + 0.48 % and + 1.37 % in the four-shot setting, respectively.

Region Attention Fine-tuning with CLIP for Few-shot Classification

Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

Focus Your Attention when Few-Shot Classification

Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners

Learning to Adapt Category Consistent Meta-Feature of CLIP for Few-Shot Classification

Image Classification with Frequency Channel Attention under the Few-Shot Condition

Object-aware Long-short-range Spatial Alignment for Few-Shot Fine-Grained Image Classification

Attention-Based Contrastive Learning for Few-Shot Remote Sensing Image Classification

Reinforced Attention for Few-Shot Learning and Beyond

Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation

Cross Attention Network for Few-shot Classification

Boosting Few-Shot Segmentation via Instance-Aware Data Augmentation and Local Consensus Guided Cross Attention

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia

Task-wise Attention Guided Part Complementary Learning for Few-Shot Image Classification

Learning more discriminative local descriptors with parameter-free weighted attention for few-shot learning

A Closer Look at Few-Shot Video Classification: A New Baseline and Benchmark

CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention

TSCLIP: Robust CLIP Fine-Tuning for Worldwide Cross-Regional Traffic Sign Recognition