Abstract:The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at <a class="link-external link-https" href="https://github.com/sallymmx/ActionCLIP.git" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address several key challenges in the field of video action recognition: 1. **Limitations of traditional methods**: Traditional video action recognition methods mainly rely on single - modality frameworks, that is, only using video data for classification. These methods map labels to numbers or one - hot vectors, ignoring the semantic information of label texts. This limits the generalization ability of the model on new datasets, especially when facing unseen concepts. 2. **Zero - shot and few - shot learning**: Existing methods perform poorly in handling zero - shot and few - shot tasks because they require additional annotated data for fine - tuning. The author hopes to enhance the generalization ability of the model by introducing natural language supervision, enabling it to make effective predictions without additional annotated data. 3. **Utilizing large - scale network data**: Existing fully - supervised video action recognition dataset labels are usually too short to construct rich sentences for language learning. Meanwhile, collecting and annotating new video datasets requires huge storage resources and labor costs. However, a large number of videos with rich but noisy text labels are generated on the network every day. How to effectively utilize these large - scale network data is an urgent problem to be solved. 4. **A new paradigm of pre - training, prompting and fine - tuning**: To overcome the above challenges, the author proposes a new paradigm - "pre - train, prompt, and fine - tune". By designing appropriate prompts, this paradigm can directly reuse the models pre - trained on large - scale network data, thereby significantly reducing the pre - training cost and improving the performance of the model on specific datasets. ### Main contributions of the paper - **Multi - modality learning framework**: Model the video action recognition task as a multi - modality learning problem instead of the traditional single - modality classification task. By introducing semantic language supervision, the representation ability is enhanced, and the generalization ability of the model in zero - shot / few - shot situations is improved. - **New paradigm**: Propose a new paradigm of "pre - training, prompting and fine - tuning", which can directly reuse the powerful large - scale network data pre - trained models, avoiding high pre - training costs. - **Experimental verification**: The effectiveness of this method has been verified through extensive experiments, and it is consistently superior to existing methods on multiple public benchmark datasets. ### Presentation of formulas in Markdown format Some of the formulas involved in the paper are as follows: 1. **Similarity function**: \[ s(x, y)=\frac{v \cdot w^{\top}}{\|v\|\|w\|} \] \[ s(y, x)=\frac{w \cdot v^{\top}}{\|w\|\|v\|} \] where \(v = g_V(x)\) and \(w = g_W(y)\) are the encoded features of the video and the label respectively. 2. **Normalized similarity score**: \[ p_{x \to y}^{i}(x)=\frac{\exp(s(x, y_{i}) / \tau)}{\sum_{j = 1}^{N}\exp(s(x, y_{j}) / \tau)} \] \[ p_{y \to x}^{i}(y)=\frac{\exp(s(y, x_{i}) / \tau)}{\sum_{j = 1}^{N}\exp(s(y, x_{j}) / \tau)} \] where \(\tau\) is a learnable temperature parameter, and \(N\) is the number of training pairs. 3. **Contrastive loss function**: \[ L=\frac{1}{2}\mathbb{E}_{(x, y)\sim D}\left[KL(p_{x \to y}(x), q_{x \to y}(x))+KL(p_{y \to x}(y), q_{y \to x}(y))\right]

ActionCLIP: A New Paradigm for Video Action Recognition

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

Action Recognition Via Fine-Tuned CLIP Model and Temporal Transformer.

Fine-grained Knowledge Graph-driven Video-Language Learning for Action Recognition

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model Via Interpolated Weight Optimization

GBC: Guided Alignment and Adaptive Boosting CLIP Bridging Vision and Language for Robust Action Recognition

Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP

PromptLearner-CLIP: Contrastive Multi-Modal Action Representation Learning with Context Optimization

Leveraging Temporal Contextualization for Video Action Recognition

EPK-CLIP: External and Priori Knowledge CLIP for action recognition

Consistency Prototype Module and Motion Compensation for Few-Shot Action Recognition (CLIP-CP$\mathbf{M^2}$C)

Action Machine: Rethinking Action Recognition in Trimmed Videos

Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition

CLIP-guided Prototype Modulating for Few-shot Action Recognition

OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning

ActivityCLIP: Enhancing Group Activity Recognition by Mining Complementary Information from Text to Supplement Image Modality