Abstract:Learning generalized representations from limited training samples is crucial for applying deep neural networks in low-resource scenarios. Recently, methods based on contrastive language-image pretraining (CLIP) have exhibited promising performance in few-shot adaptation tasks. To avoid catastrophic forgetting and overfitting caused by few-shot fine-tuning, existing works usually freeze the parameters of CLIP pretrained on large-scale datasets, overlooking the possibility that some parameters might not be suitable for downstream tasks. To this end, we revisit CLIP's visual encoder with a specific focus on its distinctive attention pooling layer, which performs a spatial weighted-sum of the dense feature maps. Given that dense feature maps contain meaningful semantic information, and different semantics hold varying importance for diverse downstream tasks (such as prioritizing semantics like ears and eyes in pet classification tasks rather than side mirrors), using the same weighted-sum operation for dense features across different few-shot tasks might not be appropriate. Hence, we propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics. In the inference process, we perform residual blending between the features pooled by the fine-tuned and the original attention pooling layers to incorporate both the few-shot knowledge and the pretrained CLIP's prior knowledge. We term this method as semantic-aware fine-tuning (). is effective in enhancing the conventional few-shot CLIP and is compatible with the existing adapter approach (termed ). Extensive experiments on 11 benchmarks demonstrate that both and significantly outperform the second-best method by + 1.51 % and + 2.38 % in the one-shot setting and by + 0.48 % and + 1.37 % in the four-shot setting, respectively.

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

ActionCLIP: A New Paradigm for Video Action Recognition

Enhancing Few-Shot CLIP With Semantic-Aware Fine-Tuning

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

ActPrompt: In-Domain Feature Adaptation via Action Cues for Video Temporal Grounding

MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring.

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning

Leveraging Temporal Contextualization for Video Action Recognition

In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference