Abstract:Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person's interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in <a class="link-external link-https" href="https://webber2933.github.io/ST-CLIP-project-page" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the problem of insufficient generalization ability in **Zero - Shot Spatio - Temporal Action Detection**. Specifically, the existing spatio - temporal action detection methods mainly rely on fully - supervised learning, which makes them only able to recognize the action categories included in the training phase and cannot be well generalized to unseen action categories. To overcome this limitation, the author proposes a new method, aiming to use pre - trained image - language models (such as CLIP) to detect unseen actions. ### Main problems 1. **Limitations of existing methods**: Current methods mainly focus on fully - supervised learning, resulting in poor performance in recognizing unseen action categories. 2. **Challenges in multi - action videos**: In real - life scenarios, there may be multiple individuals performing different actions simultaneously in a video, which poses higher requirements for existing methods. 3. **Cost of data annotation**: Fully - supervised learning requires a large amount of annotated data, and the annotation process is very time - consuming and costly. ### Solutions To solve the above problems, the author proposes the ST - CLIP framework, which specifically includes the following aspects: - **Person - Context Interaction**: Use the visual knowledge of CLIP to model the relationship between people and their surrounding environment without the need for additional interaction modules. - **Context Prompting**: Through a multi - level context - prompting module, gradually use spatio - temporal context information to enhance text descriptions, thereby improving the discriminative ability of classification. - **Interest Token Spotting**: Introduce an interest token - spotting mechanism to identify the context tokens most relevant to each individual's action and generate personalized text features. ### Evaluation methods To evaluate the effectiveness of this method, the author established benchmark tests on three popular spatio - temporal action detection datasets (J - HMDB, UCF101 - 24, and AVA) and conducted extensive experiments. In particular, the author ensured a comprehensive evaluation of multiple unseen action categories through cross - validation and random selection of different category combinations. ### Experimental results The experimental results show that ST - CLIP performs excellently in detecting unseen actions. Especially on the AVA dataset, it can effectively recognize different actions performed by different individuals in the same video. In addition, experiments on the J - HMDB and UCF101 - 24 datasets also prove the competitiveness of this method, demonstrating its potential in practical applications. In conclusion, the main contribution of this paper is to propose a new framework that can effectively detect spatio - temporal actions in the zero - sample case, solve the lack of generalization ability of existing methods, and provide new ideas for future spatio - temporal action detection research.

Spatio-Temporal Context Prompting for Zero-Shot Action Detection

Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection

Context-Guided Super-Class Inference for Zero-Shot Detection

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Multi-modal Prompting for Low-Shot Temporal Action Localization

Spatial–Temporal Context-Aware Online Action Detection and Prediction

Zero-Shot Temporal Action Detection by Learning Multimodal Prompts and Text-Enhanced Actionness

ZSTAD: Zero-Shot Temporal Activity Detection

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context.

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Leveraging Temporal Contextualization for Video Action Recognition

Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

Separately Guided Context-Aware Network for Weakly Supervised Temporal Action Detection

Spatio-Temporal Action Detection with Multi-Object Interaction

Attentive Action and Context Factorization

ContextDet: Temporal Action Detection with Adaptive Context Aggregation

Online Action Tube Detection Via Resolving The Spatio-Temporal Context Pattern

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection