ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection

Thinh Phan,Khoa Vo,Duy Le,Gianfranco Doretto,Donald Adjeroh,Ngan Le
2023-11-05
Abstract:Temporal action detection (TAD) involves the localization and classification of action instances within untrimmed videos. While standard TAD follows fully supervised learning with closed-set setting on large training data, recent zero-shot TAD methods showcase the promising open-set setting by leveraging large-scale contrastive visual-language (ViL) pretrained models. However, existing zero-shot TAD methods have limitations on how to properly construct the strong relationship between two interdependent tasks of localization and classification and adapt ViL model to video understanding. In this work, we present ZEETAD, featuring two modules: dual-localization and zero-shot proposal classification. The former is a Transformer-based module that detects action events while selectively collecting crucial semantic embeddings for later recognition. The latter one, CLIP-based module, generates semantic embeddings from text and frame inputs for each temporal unit. Additionally, we enhance discriminative capability on unseen classes by minimally updating the frozen CLIP encoder with lightweight adapters. Extensive experiments on THUMOS14 and ActivityNet-1.3 datasets demonstrate our approach's superior performance in zero-shot TAD and effective knowledge transfer from ViL models to unseen action categories.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve zero - shot end - to - end temporal action detection (Zero - Shot End - to - End Temporal Action Detection, ZS TAD) on unseen action categories. Specifically, the paper aims to overcome the limitations of existing methods in dealing with open - set scenarios, especially in how to build a strong relationship between the localization and classification tasks and how to adapt large - scale vision - language pre - training models (Vision - Language, ViL) to video understanding. ### Background and Problem Definition of the Paper **Background**: - **Temporal Action Detection (TAD)**: It involves localizing and classifying action instances in unedited videos. - **Standard TAD**: Usually adopts fully - supervised learning and is in a closed - set setting on a large amount of training data. - **Zero - shot TAD**: Utilizes large - scale contrastive vision - language pre - training models (such as CLIP) and shows promise in an open - set setting. **Problems**: - Existing zero - shot TAD methods have limitations in how to correctly build a strong relationship between the localization and classification tasks and how to adapt ViL models to video understanding. - An effective end - to - end model architecture is required that can perform action localization and classification simultaneously in open - set scenarios. ### Solutions **ZEETAD Model**: - **Dual - Localization Module**: A Transformer - based module for detecting action events and selectively collecting key semantic embeddings for subsequent recognition. - **Zero - Shot Proposal Classification Module**: A CLIP - based module that generates semantic embeddings for each time unit from text and frame inputs. - **Lightweight Adapters**: Enhance the discrimination ability for unseen categories by minimally updating the frozen CLIP encoder. ### Technical Details 1. **Dual - Localization Mechanism**: - **Objective**: Not only determine the proposal boundaries but also segment the semantic embeddings synthesized by CLIP. - **Implementation**: Use video clip features extracted by 3D Convolutional Neural Network (CNN) for localization, and the frame embeddings generated by CLIP's image encoder for classification. 2. **Efficient Fine - Tuning Method (Adapters)**: - **Purpose**: Adapt large - scale ViL models to the video domain. - **Implementation**: Only update the lightweight adapters injected into the frozen CLIP Transformer sub - layers. 3. **End - to - End Model Architecture**: - **One - stage TAD Model**: Contains a learnable dual - localization module and a zero - shot proposal classification module. - **Process**: - **Frame Embeddings**: Obtain video frame embeddings of intermediate RGB frames through the CLIP visual encoder. - **Temporal Modeling**: Apply a temporal Transformer to model the frame embeddings. - **Semantic Representation**: Multiply the frame embeddings with the text embeddings to generate semantic embeddings for each frame. - **Dynamic Foreground Mask**: Generate a dynamic foreground mask of semantic embeddings related to action boundaries. - **Classification**: Aggregate the selected semantic embeddings and identify the category with the highest matching degree. ### Experimental Results **Datasets and Metrics**: - **Datasets**: THUMOS14 and ActivityNet - 1.3. - **Evaluation Metrics**: Mean Average Precision (mAP) at different IOU thresholds. **Main Results**: - The performance of ZEETAD on THUMOS14 and ActivityNet - 1.3 is significantly better than that of existing zero - shot TAD methods. - See Table 1 for specific values. ZEETAD performs excellently in mAP at multiple IOU thresholds. ### Summary This paper successfully solves the key problems in zero - shot temporal action detection by designing the ZEETAD model, especially the ability to perform action localization and classification simultaneously in open - set scenarios.