Text-Enhanced Zero-Shot Action Recognition: A training-free approach

Massimo Bosetti,Shibingfeng Zhang,Bendetta Liberatori,Giacomo Zara,Elisa Ricci,Paolo Rota
2024-08-29
Abstract:Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zero-shot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require extensive training on specific datasets, which can be resource-intensive and may introduce domain biases. In this work, we propose Text-Enhanced Action Recognition (TEAR), a simple approach to ZS-VAR that is training-free and does not require the availability of training data or extensive computational resources. Drawing inspiration from recent findings in vision and language literature, we utilize action descriptors for decomposition and contextual information to enhance zero-shot action recognition. Through experiments on UCF101, HMDB51, and Kinetics-600 datasets, we showcase the effectiveness and applicability of our proposed approach in addressing the challenges of ZS-VAR.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in zero - shot video action recognition (ZS - VAR). Specifically, existing methods have difficulties in handling the dynamic and temporal characteristics in videos, and usually require a large amount of training data and computing resources. This is not only time - consuming but may also introduce domain bias, limiting the generalization ability of the model. To solve these problems, the authors propose a new method named **Text - Enhanced Action Recognition (TEAR)**, which is a training - free method aiming to achieve zero - shot video action recognition by leveraging text - enhancement techniques. The main contributions of TEAR include: 1. **Propose a training - free zero - shot video action recognition method for the first time**: This method does not require training data or a large amount of computing resources, making ZS - VAR more suitable for practical applications. 2. **Enhance the understanding and recognition ability of actions in videos by decomposing action labels and providing visually - relevant descriptions**. The authors show how to improve the zero - shot action recognition task by decomposing actions, providing descriptions, and context information. 3. **Conduct experiments on three standard datasets, namely UCF101, HMDB51, and Kinetics - 600. The results show that TEAR's performance on these datasets is competitive and even outperforms training - based methods**. ### Method Overview The core idea of TEAR is to use pre - trained vision - language models (VLM) and large - language models (LLM) to generate text descriptors of actions, and then make zero - shot predictions through these descriptors. The specific steps are as follows: 1. **Generate action descriptors**: - Use LLM to generate multiple descriptors of actions, including category names, decomposed action steps, detailed semantic descriptions, context information, and combinations of these descriptors. - For example, for the action of "snowboarding", the generated descriptors may include: - Category: "snowboarding" - Decomposition: "Fasten feet to the snowboard", "Lean forward to start sliding", "Use heel - to - toe weight transfer to turn and maintain balance" - Description: "A person stands on a single - board and slides down the snow slope, while making turns and jumps and maintaining balance." - Context: "Snow - covered hillside or ski resort", "Snowboard", "Snow boots", "Helmet" 2. **Zero - shot recognition**: - For the test video, uniformly sample N frames, and extract the feature representation of each frame through a visual encoder. - Calculate the text representation of each type of action, and perform similarity matching with the average visual representation of the video, and select the most similar action as the prediction result. In this way, TEAR can effectively recognize actions in videos without additional training. ### Experimental Results The authors conducted experiments on multiple datasets to verify the effectiveness of TEAR. The experimental results show that TEAR improves the Top1 accuracy by +6.3% and +12.8% on the UCF101 and HMDB51 datasets respectively, and also has an improvement on the Kinetics - 600 dataset. ### Limitations Although TEAR performs well, the authors also point out its limitations: - For actions that are more fine - grained in time or very atomic, it may not be effectively decomposed. - For actions with weak object associations or large context changes, the generated text descriptors may not be accurate enough. Overall, TEAR provides an innovative and efficient method for zero - shot video action recognition and has broad application prospects.