Abstract:Few-shot action recognition (FSAR) aims to learn a model capable of identifying novel actions in videos using only a few examples. In assuming the base dataset seen during meta-training and novel dataset used for evaluation can come from different domains, cross-domain few-shot learning alleviates data collection and annotation costs required by methods with greater supervision and conventional (single-domain) few-shot methods. While this form of learning has been extensively studied for image classification, studies in cross-domain FSAR (CD-FSAR) are limited to proposing a model, rather than first understanding the cross-domain capabilities of existing models. To this end, we systematically evaluate existing state-of-the-art single-domain, transfer-based, and cross-domain FSAR methods on new cross-domain tasks with increasing difficulty, measured based on the domain shift between the base and novel set. Our empirical meta-analysis reveals a correlation between domain difference and downstream few-shot performance, and uncovers several important insights into which model aspects are effective for CD-FSAR and which need further development. Namely, we find that as the domain difference increases, the simple transfer-learning approach outperforms other methods by over 12 percentage points, and under these more challenging cross-domain settings, the specialised cross-domain model achieves the lowest performance. We also witness state-of-the-art single-domain FSAR models which use temporal alignment achieving similar or worse performance than earlier methods which do not, suggesting existing temporal alignment techniques fail to generalise on unseen domains. To the best of our knowledge, we are the first to systematically study the CD-FSAR problem in-depth. We hope the insights and challenges revealed in our study inspires and informs future work in these directions.

KDM: A knowledge-guided and data-driven method for few-shot video action recognition

Knowledge-guided Pre-Training and Fine-Tuning: Video Representation Learning for Action Recognition

TFRS: A task-level feature rectification and separation method for few-shot video action recognition

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Understanding the Cross-Domain Capabilities of Video-Based Few-Shot Action Recognition Models

MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition

Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition

Multi-directional Knowledge Transfer for Few-Shot Learning

Video Action Recognition with Attentive Semantic Units

Semantic-aware Video Representation for Few-shot Action Recognition

Knowledge Graph Enhanced Multimodal Learning for Few-shot Visual Recognition

Knowledge-Based Fine-Grained Classification for Few-Shot Learning.

Exploring Few-Shot Adaptation for Activity Recognition on Diverse Domains

Knowledge Prompting for Few-shot Action Recognition

On the Importance of Spatial Relations for Few-shot Action Recognition

Exploiting spatio‐temporal knowledge for video action recognition

Few-shot Action Recognition via Intra- and Inter-Video Information Maximization

Few-shot action recognition with implicit temporal alignment and pair similarity optimization

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

A dual-prototype network combining query-specific and class-specific attentive learning for few-shot action recognition

Fusion Attention for Action Recognition: Integrating Sparse-Dense and Global Attention for Video Action Recognition