Abstract:Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.

What problem does this paper attempt to address?

The paper primarily addresses the problem of Zero-Shot Temporal Action Localization (ZS-TAL) by proposing a novel solution. Traditional ZS-TAL methods require a large amount of annotated data for model fine-tuning, but this approach may be impractical in certain scenarios, such as when there is a lack of annotated data or domain shift issues that lead to poor generalization. To address these issues, the paper introduces a method called T3AL (Test-Time adaptation for Temporal Action Localization). The key feature of T3AL is that it directly adapts to unannotated data during the testing phase without the need for a training dataset. Specifically, T3AL is implemented through the following three steps: 1. **Video-level pseudo-label calculation**: Utilizing a pre-trained Vision and Language Model (VLM) to extract semantic information from each frame and aggregate this information to calculate video-level pseudo-labels, which are estimates of action categories within the video. 2. **Self-supervised prediction refinement**: Guided by the video pseudo-labels, a new procedure inspired by self-supervised learning is employed to refine action localization. This process includes assigning scores by comparing the similarity between frames and pseudo-labels and using these scores to further optimize model parameters. 3. **Text-based region suppression**: Advanced image description generation models are used to generate text descriptions for each frame, and these descriptions are then used to further refine the predictions of action regions. The effectiveness of T3AL is validated through experiments on the THUMOS14 and ActivityNet-v1.3 datasets. The results show that T3AL significantly improves action localization accuracy without annotated data compared to using only pre-trained models or methods that require training data. Additionally, the paper reveals the limitations of existing methods through cross-dataset generalization analysis and demonstrates the advantages of the T3AL method. In summary, T3AL aims to address the data dependency and generalization issues in zero-shot temporal action localization by directly adapting the model during the testing phase, thereby enhancing the model's ability to handle unseen data.

Test-Time Zero-Shot Temporal Action Localization

DeTAL: Open-Vocabulary Temporal Action Localization with Decoupled Networks

Action Sensitivity Learning for Temporal Action Localization

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Active learning with effective scoring functions for semi-supervised temporal action localization

ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection

The Solution for Temporal Action Localisation Task of Perception Test Challenge 2024

STAT: Towards Generalizable Temporal Action Localization

Bottom-Up Temporal Action Localization with Mutual Regularization

Zero-shot Action Localization via the Confidence of Large Vision-Language Models

Adaptive Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

Unsupervised Pre-training for Temporal Action Localization Tasks

Boosting Semi-Supervised Temporal Action Localization by Learning from Non-Target Classes

Learnable Feature Augmentation Framework for Temporal Action Localization

Prior-Enhanced Temporal Action Localization Using Subject-Aware Spatial Attention

A Novel Action Saliency and Context-Aware Network for Weakly-Supervised Temporal Action Localization

Advancing Temporal Action Localization with a Boundary Awareness Network

Prior-enhanced Temporal Action Localization using Subject-aware Spatial Attention

ZSTAD: Zero-Shot Temporal Activity Detection

Weakly Supervised Temporal Action Localization Through Contrastive Learning

Multi‐scale feature learning and temporal probing strategy for one‐stage temporal action localization