Abstract:The canonical approach to video action recognition dictates a neural network model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferability on new datasets with unseen concepts. In this article, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters' requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, adapt and fine-tune." This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then, it makes the action recognition task to act more like pre-training problems via adaptation engineering. Finally, it is fine-tuned end-to-end on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git.

An Analysis of Action Recognition Datasets for Language and Vision Tasks

How to Improve Video Analytics with Action Recognition: A Survey

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

View-invariant action recognition:a survey

Spatio-temporal Action Recognition: A Survey

Human Behavior Analysis: A Survey on Action Recognition

A Comprehensive Study of Deep Video Action Recognition

A survey on vision-based human action recognition

Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities

RGB-D-based Action Recognition Datasets: A Survey

Advances in Human Action Recognition: A Survey

A Comprehensive Survey of Vision-Based Human Action Recognition Methods

Reasoning about Actions over Visual and Linguistic Modalities: A Survey

Are Visual-Language Models Effective in Action Recognition? A Comparative Study

A Survey on Human Action Recognition

Collecting and Annotating the Large Continuous Action Dataset

Action Recognition by Exploring Data Distribution and Feature Correlation

Action Understanding with Multiple Classes of Actors

Are current long-term video understanding datasets long-term?

A Large-scale Varying-view RGB-D Action Dataset for Arbitrary-view Human Action Recognition

A survey on still image based human action recognition