Abstract:Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person's interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in <a class="link-external link-https" href="https://webber2933.github.io/ST-CLIP-project-page" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the problem of insufficient generalization ability in **Zero - Shot Spatio - Temporal Action Detection**. Specifically, the existing spatio - temporal action detection methods mainly rely on fully - supervised learning, which makes them only able to recognize the action categories included in the training phase and cannot be well generalized to unseen action categories. To overcome this limitation, the author proposes a new method, aiming to use pre - trained image - language models (such as CLIP) to detect unseen actions.
### Main problems
1. **Limitations of existing methods**: Current methods mainly focus on fully - supervised learning, resulting in poor performance in recognizing unseen action categories.
2. **Challenges in multi - action videos**: In real - life scenarios, there may be multiple individuals performing different actions simultaneously in a video, which poses higher requirements for existing methods.
3. **Cost of data annotation**: Fully - supervised learning requires a large amount of annotated data, and the annotation process is very time - consuming and costly.
### Solutions
To solve the above problems, the author proposes the ST - CLIP framework, which specifically includes the following aspects:
- **Person - Context Interaction**: Use the visual knowledge of CLIP to model the relationship between people and their surrounding environment without the need for additional interaction modules.
- **Context Prompting**: Through a multi - level context - prompting module, gradually use spatio - temporal context information to enhance text descriptions, thereby improving the discriminative ability of classification.
- **Interest Token Spotting**: Introduce an interest token - spotting mechanism to identify the context tokens most relevant to each individual's action and generate personalized text features.
### Evaluation methods
To evaluate the effectiveness of this method, the author established benchmark tests on three popular spatio - temporal action detection datasets (J - HMDB, UCF101 - 24, and AVA) and conducted extensive experiments. In particular, the author ensured a comprehensive evaluation of multiple unseen action categories through cross - validation and random selection of different category combinations.
### Experimental results
The experimental results show that ST - CLIP performs excellently in detecting unseen actions. Especially on the AVA dataset, it can effectively recognize different actions performed by different individuals in the same video. In addition, experiments on the J - HMDB and UCF101 - 24 datasets also prove the competitiveness of this method, demonstrating its potential in practical applications.
In conclusion, the main contribution of this paper is to propose a new framework that can effectively detect spatio - temporal actions in the zero - sample case, solve the lack of generalization ability of existing methods, and provide new ideas for future spatio - temporal action detection research.