Abstract:The canonical approach to video action recognition dictates a neural network model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferability on new datasets with unseen concepts. In this article, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters' requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, adapt and fine-tune." This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then, it makes the action recognition task to act more like pre-training problems via adaptation engineering. Finally, it is fine-tuned end-to-end on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git.

Action Recognition with Uncertain VLAD

Action Recognition Using Hybrid Feature Descriptor And Vlad Video Encoding

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Action Recognition with Stacked Fisher Vectors.

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

A Method of Simultaneously Action Recognition and Video Segmentation of Video Streams.

DA-VLAD: Discriminative Action Vector of Locally Aggregated Descriptors for Action Recognition

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Realistic Human Action Recognition: when Deep Learning Meets VLAD

VLAD-SSTA: VLAD with Soft Spatio-Temporal Assignment for Action Recognition

MA-VLAD: a fine-grained local feature aggregation scheme for action recognition

Spatio-Temporal Self-Attention Weighted VLAD Neural Network for Action Recognition.

Towards Good Practices for Action Video Encoding

Good Practices for Learning to Recognize Actions Using FV and VLAD

View-invariant Human Action Recognition Via Robust Locally Adaptive Multi-View Learning

Temporal Distinct Representation Learning for Action Recognition

Video Action Recognition with Attentive Semantic Units

Combining Sparse And Dense Descriptors With Temporal Semantic Structures For Robust Human Action Recognition

A Novel Trajectory-VLAD Based Action Recognition Algorithm for Video Analysis.

ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification

Action Recognition by Exploring Data Distribution and Feature Correlation