Abstract:The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at <a class="link-external link-https" href="https://github.com/sallymmx/ActionCLIP.git" rel="external noopener nofollow">this https URL</a>

Action Machine: Rethinking Action Recognition in Trimmed Videos

Typing Video frames after person detection Pose Tube 2 D Deconv Score fusion RGB action recognition Pose action recognition Pose estimation

Action Machine: Toward Person-Centric Action Recognition in Videos.

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Online Robust Action Recognition Based on a Hierarchical Model

Human Action Recognition Using Deep Learning Methods.

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Annotation-Efficient Untrimmed Video Action Recognition

An Approach to Pose-Based Action Recognition

Part-level Action Parsing Via a Pose-guided Coarse-to-Fine Framework

Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition

Action Recognition In Rgb-D Egocentric Videos

Efficient Action Detection in Untrimmed Videos via Multi-Task Learning

Joint Dynamic Pose Image and Space Time Reversal for Human Action Recognition from Videos

ActionCLIP: A New Paradigm for Video Action Recognition

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Action Recognition by Exploring Data Distribution and Feature Correlation

Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition