Abstract:The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at <a class="link-external link-https" href="https://github.com/sallymmx/ActionCLIP.git" rel="external noopener nofollow">this https URL</a>

Category-Specific Prompts for Animal Action Recognition with Pretrained Vision-Language Models

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Knowledge-guided Pre-Training and Fine-Tuning: Video Representation Learning for Action Recognition

Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models

Multi-modal Attribute Prompting for Vision-Language Models

GBC: Guided Alignment and Adaptive Boosting CLIP Bridging Vision and Language for Robust Action Recognition

Spatio-Temporal Context Prompting for Zero-Shot Action Detection

PromptPose: Language Prompt Helps Animal Pose Estimation.

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

ActionCLIP: A New Paradigm for Video Action Recognition

Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning

Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection

Commonsense Knowledge Prompting for Few-shot Action Recognition in Videos

Understanding the Multi-modal Prompts of the Pre-trained Vision-Language Model

ActPrompt: In-Domain Feature Adaptation via Action Cues for Video Temporal Grounding

Prompting Visual-Language Models for Efficient Video Understanding