Abstract:One-shot action recognition aims at recognizing actions in unseen classes in cases where only one training video is provided. Compared with one-shot image recognition, one-shot learning on videos is more difficult due to the fact that the temporal dimension of video may lead to greater variation. To handle this variation, it is important to conduct further adaptation in the one-shot training process, despite the scarcity of the training data. While meta-learning is an option for facilitating this adaptation, it cannot be directly applied for two reasons: first, deep networks for action recognition can make current meta-learning methods infeasible to run because of their high computational complexity; second, due to the greater variation in actions, the adapted performance may not be higher than the un-adapted one, making it difficult to train the model by means of meta-learning. To address these problems and facilitate the adaptation, we propose the Adaptation-Oriented Feature (AOF) projection for one-shot action recognition. We first pre-train the base network on seen classes. The output of the network is projected to the adaptation-oriented feature space by fusing the important feature dimensions that are sensitive to adaptation. Subsequently, a small dataset (a.k.a. task) is sampled from seen classes to simulate the unseen-class training and testing settings. The feature adaptation is performed on the training data of this task to integrate the distribution information of the adapted feature. In order to reduce over-fitting, the triplet loss is applied to handle temporal variation with fewer parameters during the adaptation. On the testing data of this task, the losses on both adapted and un-adapted features are calculated to train the projection matrix. This sampling-adaptation-training procedure is then repeated on seen classes until convergence. Extensive experimental results on two challenging one-shot action recognition datasets demonstrate that our proposed method outperforms state-of-the-art methods.

Adaptation-Oriented Feature Projection for One-Shot Action Recognition

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

A Pairwise Attentive Adversarial Spatiotemporal Network for Cross-Domain Few-Shot Action Recognition-R2.

Part-aware Prototypical Graph Network for One-shot Skeleton-based Action Recognition

Spatial-Temporal Adaptive Metric Learning Network for One-Shot Skeleton-Based Action Recognition

Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition

Hierarchical Temporal Memory Enhanced One-Shot Distance Learning for Action Recognition

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

Task-Aware Dual-Representation Network for Few-Shot Action Recognition

Few-shot Action Recognition with Prototype-centered Attentive Learning

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Embodied One-Shot Video Recognition

Task-specific alignment and multiple-level transformer for few-shot action recognition

Object-based (yet Class-agnostic) Video Domain Adaptation

MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition

Few-shot action recognition with implicit temporal alignment and pair similarity optimization

D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

Matching Compound Prototypes for Few-Shot Action Recognition

Enhancing Few-Shot Action Recognition Using Skeleton Temporal Alignment and Adversarial Training