Abstract:The majority of efforts on improving the performance of few-shot action recognition are dedicated to designing sophisticated temporal alignment algorithms. However, these works all heavily rely on prior knowledge within a pre-trained model. Recently, CLIP (Contrastive Language-Image Pre-Training) has shown significant few-shot learning capability in various downstream tasks. Existing works fine-tune CLIP directly on the novel classes without considering the potential utilization of the adequately labeled base class data. In this work, we conduct a thorough exploration of the adaptation strategies of CLIP for few-shot action recognition. Our findings reveal that despite using a large-scale pre-trained model such as CLIP, it remains necessary to utilize sufficient base class data, if available, to fine-tune the model rather than directly fine-tuning on the novel classes. Moreover, we compare two classical adaptation algorithms proposed to address few-shot learning: Meta-learning and Finetuning \footnoteFor clarity, in this paper we use " Finetuning '' to denote a specific adaptation strategy different from " Meta-learning '', and use "fine-tuning'', which is non-italicized and hyphenated, to denote modifying the parameters of a pre-trained model. Our results indicate that Meta-learning is the better method to inspire the generalization potential of the CLIP. Additionally, we propose to use an overlooked, simple but efficient fine-tuning method: partial fine-tuning, which only fine-tunes the last layer of the backbone. It requires fewer learnable parameters and less computational cost compared to full fine-tuning or fine-tuning additionally introduced adapter modules. Extensive experiments conducted on HMDB51, UCF101, and Kinetics datasets consistently demonstrate the superior generalization ability of our method, which achieves new state-of-the-art results in few-shot action recognition.

MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Few-shot Action Recognition with Captioning Foundation Models

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

Skeleton-Based Few-Shot Action Recognition via Fine-Grained Information Capture and Adaptive Metric Aggregation

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition

Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP

FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs

TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition

MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition

Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning

Motion-modulated Temporal Fragment Alignment Network for Few-Shot Action Recognition

SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization

Semantic-aware Video Representation for Few-shot Action Recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Few-Shot Model Adaptation for Customized Facial Landmark Detection, Segmentation, Stylization and Shadow Removal