Abstract:The canonical approach to video action recognition dictates a neural network model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferability on new datasets with unseen concepts. In this article, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters' requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, adapt and fine-tune." This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then, it makes the action recognition task to act more like pre-training problems via adaptation engineering. Finally, it is fine-tuned end-to-end on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git.

Human Action Adverb Recognition: ADHA Dataset and a Three-Stream Hybrid Model

Human Action Recognition Using Deep Learning Methods.

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Further Understanding Videos through Adverbs: A New Video Task

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

HabitAction: A Video Dataset for Human Habitual Behavior Recognition

Add: Actionness-Pooled Deep-Convolutional Descriptor

Action Recognition Using Hybrid Feature Descriptor And Vlad Video Encoding

Actionness-pooled Deep-convolutional Descriptor for Fine-Grained Action Recognition.

A Survey on Human Action Recognition

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

Human action recognition using Adaptive Hierarchical Depth Motion Maps and Gabor filter

Two-Stream Dictionary Learning Architecture for Action Recognition

Human Action Recognition Using Two-Stream Attention Based LSTM Networks

Human-to-Human Interaction Detection

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Real Time Human Action Recognition in a Long Video Sequence

Multi-Label Action Anticipation for Real-World Videos with Scene Understanding

Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition