Abstract:Temporal action detection (TAD) involves the localization and classification of action instances within untrimmed videos. While standard TAD follows fully supervised learning with closed-set setting on large training data, recent zero-shot TAD methods showcase the promising open-set setting by leveraging large-scale contrastive visual-language (ViL) pretrained models. However, existing zero-shot TAD methods have limitations on how to properly construct the strong relationship between two interdependent tasks of localization and classification and adapt ViL model to video understanding. In this work, we present ZEETAD, featuring two modules: dual-localization and zero-shot proposal classification. The former is a Transformer-based module that detects action events while selectively collecting crucial semantic embeddings for later recognition. The latter one, CLIP-based module, generates semantic embeddings from text and frame inputs for each temporal unit. Additionally, we enhance discriminative capability on unseen classes by minimally updating the frozen CLIP encoder with lightweight adapters. Extensive experiments on THUMOS14 and ActivityNet-1.3 datasets demonstrate our approach's superior performance in zero-shot TAD and effective knowledge transfer from ViL models to unseen action categories.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve zero - shot end - to - end temporal action detection (Zero - Shot End - to - End Temporal Action Detection, ZS TAD) on unseen action categories. Specifically, the paper aims to overcome the limitations of existing methods in dealing with open - set scenarios, especially in how to build a strong relationship between the localization and classification tasks and how to adapt large - scale vision - language pre - training models (Vision - Language, ViL) to video understanding. ### Background and Problem Definition of the Paper **Background**: - **Temporal Action Detection (TAD)**: It involves localizing and classifying action instances in unedited videos. - **Standard TAD**: Usually adopts fully - supervised learning and is in a closed - set setting on a large amount of training data. - **Zero - shot TAD**: Utilizes large - scale contrastive vision - language pre - training models (such as CLIP) and shows promise in an open - set setting. **Problems**: - Existing zero - shot TAD methods have limitations in how to correctly build a strong relationship between the localization and classification tasks and how to adapt ViL models to video understanding. - An effective end - to - end model architecture is required that can perform action localization and classification simultaneously in open - set scenarios. ### Solutions **ZEETAD Model**: - **Dual - Localization Module**: A Transformer - based module for detecting action events and selectively collecting key semantic embeddings for subsequent recognition. - **Zero - Shot Proposal Classification Module**: A CLIP - based module that generates semantic embeddings for each time unit from text and frame inputs. - **Lightweight Adapters**: Enhance the discrimination ability for unseen categories by minimally updating the frozen CLIP encoder. ### Technical Details 1. **Dual - Localization Mechanism**: - **Objective**: Not only determine the proposal boundaries but also segment the semantic embeddings synthesized by CLIP. - **Implementation**: Use video clip features extracted by 3D Convolutional Neural Network (CNN) for localization, and the frame embeddings generated by CLIP's image encoder for classification. 2. **Efficient Fine - Tuning Method (Adapters)**: - **Purpose**: Adapt large - scale ViL models to the video domain. - **Implementation**: Only update the lightweight adapters injected into the frozen CLIP Transformer sub - layers. 3. **End - to - End Model Architecture**: - **One - stage TAD Model**: Contains a learnable dual - localization module and a zero - shot proposal classification module. - **Process**: - **Frame Embeddings**: Obtain video frame embeddings of intermediate RGB frames through the CLIP visual encoder. - **Temporal Modeling**: Apply a temporal Transformer to model the frame embeddings. - **Semantic Representation**: Multiply the frame embeddings with the text embeddings to generate semantic embeddings for each frame. - **Dynamic Foreground Mask**: Generate a dynamic foreground mask of semantic embeddings related to action boundaries. - **Classification**: Aggregate the selected semantic embeddings and identify the category with the highest matching degree. ### Experimental Results **Datasets and Metrics**: - **Datasets**: THUMOS14 and ActivityNet - 1.3. - **Evaluation Metrics**: Mean Average Precision (mAP) at different IOU thresholds. **Main Results**: - The performance of ZEETAD on THUMOS14 and ActivityNet - 1.3 is significantly better than that of existing zero - shot TAD methods. - See Table 1 for specific values. ZEETAD performs excellently in mAP at multiple IOU thresholds. ### Summary This paper successfully solves the key problems in zero - shot temporal action detection by designing the ZEETAD model, especially the ability to perform action localization and classification simultaneously in open - set scenarios.

ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection

Zero-Shot Temporal Action Detection via Vision-Language Prompting

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

Zero-Shot Temporal Action Detection by Learning Multimodal Prompts and Text-Enhanced Actionness

DeTAL: Open-Vocabulary Temporal Action Localization with Decoupled Networks

ZSTAD: Zero-Shot Temporal Activity Detection

Transformer-Based Approach Via Contrastive Learning for Zero-Shot Detection.

TN-ZSTAD: Transferable Network for Zero-Shot Temporal Activity Detection.

End-to-End Temporal Action Detection with Transformer.

Test-Time Zero-Shot Temporal Action Localization

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features

Multi-Modal Few-Shot Temporal Action Detection Via Vision-Language Meta-Adaptation

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

Action Recognition Via Fine-Tuned CLIP Model and Temporal Transformer.

Temporal–Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning

Multi-Modal Few-Shot Temporal Action Detection

AV-TAD: Audio-Visual Temporal Action Detection with Transformer

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

Transductive Zero-Shot Action Recognition by Word-Vector Embedding

An Empirical Study of End-to-End Temporal Action Detection