Abstract:Open-vocabulary Temporal Action Detection (Open-vocab TAD) is an advanced video analysis approach that expands Closed-vocabulary Temporal Action Detection (Closed-vocab TAD) capabilities. Closed-vocab TAD is typically confined to localizing and classifying actions based on a predefined set of categories. In contrast, Open-vocab TAD goes further and is not limited to these predefined categories. This is particularly useful in real-world scenarios where the variety of actions in videos can be vast and not always predictable. The prevalent methods in Open-vocab TAD typically employ a 2-stage approach, which involves generating action proposals and then identifying those actions. However, errors made during the first stage can adversely affect the subsequent action identification accuracy. Additionally, existing studies face challenges in handling actions of different durations owing to the use of fixed temporal processing methods. Therefore, we propose a 1-stage approach consisting of two primary modules: Multi-scale Video Analysis (MVA) and Video-Text Alignment (VTA). The MVA module captures actions at varying temporal resolutions, overcoming the challenge of detecting actions with diverse durations. The VTA module leverages the synergy between visual and textual modalities to precisely align video segments with corresponding action labels, a critical step for accurate action identification in Open-vocab scenarios. Evaluations on widely recognized datasets THUMOS14 and ActivityNet-1.3, showed that the proposed method achieved superior results compared to the other methods in both Open-vocab and Closed-vocab settings. This serves as a strong demonstration of the effectiveness of the proposed method in the TAD task.

DeTAL: Open-Vocabulary Temporal Action Localization with Decoupled Networks

Open-Vocabulary Temporal Action Localization using Multimodal Guidance

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

Test-Time Zero-Shot Temporal Action Localization

ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language

Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

VAL: Visual-Attention Action Localizer

Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Active learning with effective scoring functions for semi-supervised temporal action localization

Action Sensitivity Learning for Temporal Action Localization

Open-Vocabulary Spatio-Temporal Action Detection

Vectorized Evidential Learning for Weakly-supervised Temporal Action Localization

Temporal Deformable Transformer for Action Localization

Weakly supervised temporal action localization: a survey

One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features

OW-TAL: Learning Unknown Human Activities for Open-World Temporal Action Localization

STAT: Towards Generalizable Temporal Action Localization

Boosting Semi-Supervised Temporal Action Localization by Learning from Non-Target Classes

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection