Abstract:Open-vocabulary Temporal Action Detection (Open-vocab TAD) is an advanced video analysis approach that expands Closed-vocabulary Temporal Action Detection (Closed-vocab TAD) capabilities. Closed-vocab TAD is typically confined to localizing and classifying actions based on a predefined set of categories. In contrast, Open-vocab TAD goes further and is not limited to these predefined categories. This is particularly useful in real-world scenarios where the variety of actions in videos can be vast and not always predictable. The prevalent methods in Open-vocab TAD typically employ a 2-stage approach, which involves generating action proposals and then identifying those actions. However, errors made during the first stage can adversely affect the subsequent action identification accuracy. Additionally, existing studies face challenges in handling actions of different durations owing to the use of fixed temporal processing methods. Therefore, we propose a 1-stage approach consisting of two primary modules: Multi-scale Video Analysis (MVA) and Video-Text Alignment (VTA). The MVA module captures actions at varying temporal resolutions, overcoming the challenge of detecting actions with diverse durations. The VTA module leverages the synergy between visual and textual modalities to precisely align video segments with corresponding action labels, a critical step for accurate action identification in Open-vocab scenarios. Evaluations on widely recognized datasets THUMOS14 and ActivityNet-1.3, showed that the proposed method achieved superior results compared to the other methods in both Open-vocab and Closed-vocab settings. This serves as a strong demonstration of the effectiveness of the proposed method in the TAD task.

AV-TAD: Audio-Visual Temporal Action Detection with Transformer

End-to-End Temporal Action Detection with Transformer.

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

TxVAD: Improved Video Action Detection by Transformers

One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features

Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

ConvTransformer Attention Network for temporal action detection

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Dilated Transformer with Feature Aggregation Module for Action Segmentation

Temporal Action Localization with Enhanced Instant Discriminability

Transavs: End-To-End Audio-Visual Segmentation With Transformer

AVSegFormer: Audio-Visual Segmentation with Transformer

ContextDet: Temporal Action Detection with Adaptive Context Aggregation

Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation

Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization

Harnessing Temporal Causality for Advanced Temporal Action Detection

Temporal Deformable Transformer for Action Localization

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Long-Term Pre-training for Temporal Action Detection with Transformers

TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression