Abstract:Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zero-shot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require extensive training on specific datasets, which can be resource-intensive and may introduce domain biases. In this work, we propose Text-Enhanced Action Recognition (TEAR), a simple approach to ZS-VAR that is training-free and does not require the availability of training data or extensive computational resources. Drawing inspiration from recent findings in vision and language literature, we utilize action descriptors for decomposition and contextual information to enhance zero-shot action recognition. Through experiments on UCF101, HMDB51, and Kinetics-600 datasets, we showcase the effectiveness and applicability of our proposed approach in addressing the challenges of ZS-VAR.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in zero - shot video action recognition (ZS - VAR). Specifically, existing methods have difficulties in handling the dynamic and temporal characteristics in videos, and usually require a large amount of training data and computing resources. This is not only time - consuming but may also introduce domain bias, limiting the generalization ability of the model. To solve these problems, the authors propose a new method named **Text - Enhanced Action Recognition (TEAR)**, which is a training - free method aiming to achieve zero - shot video action recognition by leveraging text - enhancement techniques. The main contributions of TEAR include: 1. **Propose a training - free zero - shot video action recognition method for the first time**: This method does not require training data or a large amount of computing resources, making ZS - VAR more suitable for practical applications. 2. **Enhance the understanding and recognition ability of actions in videos by decomposing action labels and providing visually - relevant descriptions**. The authors show how to improve the zero - shot action recognition task by decomposing actions, providing descriptions, and context information. 3. **Conduct experiments on three standard datasets, namely UCF101, HMDB51, and Kinetics - 600. The results show that TEAR's performance on these datasets is competitive and even outperforms training - based methods**. ### Method Overview The core idea of TEAR is to use pre - trained vision - language models (VLM) and large - language models (LLM) to generate text descriptors of actions, and then make zero - shot predictions through these descriptors. The specific steps are as follows: 1. **Generate action descriptors**: - Use LLM to generate multiple descriptors of actions, including category names, decomposed action steps, detailed semantic descriptions, context information, and combinations of these descriptors. - For example, for the action of "snowboarding", the generated descriptors may include: - Category: "snowboarding" - Decomposition: "Fasten feet to the snowboard", "Lean forward to start sliding", "Use heel - to - toe weight transfer to turn and maintain balance" - Description: "A person stands on a single - board and slides down the snow slope, while making turns and jumps and maintaining balance." - Context: "Snow - covered hillside or ski resort", "Snowboard", "Snow boots", "Helmet" 2. **Zero - shot recognition**: - For the test video, uniformly sample N frames, and extract the feature representation of each frame through a visual encoder. - Calculate the text representation of each type of action, and perform similarity matching with the average visual representation of the video, and select the most similar action as the prediction result. In this way, TEAR can effectively recognize actions in videos without additional training. ### Experimental Results The authors conducted experiments on multiple datasets to verify the effectiveness of TEAR. The experimental results show that TEAR improves the Top1 accuracy by +6.3% and +12.8% on the UCF101 and HMDB51 datasets respectively, and also has an improvement on the Kinetics - 600 dataset. ### Limitations Although TEAR performs well, the authors also point out its limitations: - For actions that are more fine - grained in time or very atomic, it may not be effectively decomposed. - For actions with weak object associations or large context changes, the generated text descriptors may not be accurate enough. Overall, TEAR provides an innovative and efficient method for zero - shot video action recognition and has broad application prospects.

Text-Enhanced Zero-Shot Action Recognition: A training-free approach

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment

Transductive Zero-Shot Action Recognition by Word-Vector Embedding

ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection

Zero-Shot Action Recognition in Surveillance Videos

Test-Time Zero-Shot Temporal Action Localization

Semantic Embedding Space for Zero-Shot Action Recognition

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition

VicTR: Video-conditioned Text Representations for Activity Recognition

ActionCLIP: A New Paradigm for Video Action Recognition

DeTAL: Open-Vocabulary Temporal Action Localization with Decoupled Networks

Exploring Semantic Inter-Class Relationships (SIR) for Zero-Shot Action Recognition.

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

Zero-Shot Temporal Action Detection by Learning Multimodal Prompts and Text-Enhanced Actionness

The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Learning Text-to-Video Retrieval from Image Captioning

Zero-shot action recognition by clustered representation with redundancy-free features

Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks