Test-Time Zero-Shot Temporal Action Localization

Benedetta Liberatori,Alessandro Conti,Paolo Rota,Yiming Wang,Elisa Ricci
2024-04-11
Abstract:Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the problem of Zero-Shot Temporal Action Localization (ZS-TAL) by proposing a novel solution. Traditional ZS-TAL methods require a large amount of annotated data for model fine-tuning, but this approach may be impractical in certain scenarios, such as when there is a lack of annotated data or domain shift issues that lead to poor generalization. To address these issues, the paper introduces a method called T3AL (Test-Time adaptation for Temporal Action Localization). The key feature of T3AL is that it directly adapts to unannotated data during the testing phase without the need for a training dataset. Specifically, T3AL is implemented through the following three steps: 1. **Video-level pseudo-label calculation**: Utilizing a pre-trained Vision and Language Model (VLM) to extract semantic information from each frame and aggregate this information to calculate video-level pseudo-labels, which are estimates of action categories within the video. 2. **Self-supervised prediction refinement**: Guided by the video pseudo-labels, a new procedure inspired by self-supervised learning is employed to refine action localization. This process includes assigning scores by comparing the similarity between frames and pseudo-labels and using these scores to further optimize model parameters. 3. **Text-based region suppression**: Advanced image description generation models are used to generate text descriptions for each frame, and these descriptions are then used to further refine the predictions of action regions. The effectiveness of T3AL is validated through experiments on the THUMOS14 and ActivityNet-v1.3 datasets. The results show that T3AL significantly improves action localization accuracy without annotated data compared to using only pre-trained models or methods that require training data. Additionally, the paper reveals the limitations of existing methods through cross-dataset generalization analysis and demonstrates the advantages of the T3AL method. In summary, T3AL aims to address the data dependency and generalization issues in zero-shot temporal action localization by directly adapting the model during the testing phase, thereby enhancing the model's ability to handle unseen data.