Zero-Shot Temporal Action Detection by Learning Multimodal Prompts and Text-Enhanced Actionness

Asif Raza,Bang Yang,Yuexian Zou
DOI: https://doi.org/10.1109/tcsvt.2024.3414275
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Zero-shot temporal action detection (ZS-TAD), aiming to recognize and detect new and unseen video actions, is an emerging and challenging task with limited solutions. Recent studies have adapted the vision-language pre-trained model CLIP for this task in a parameter-efficient fine-tuning fashion to achieve open-vocabulary detection. However, they suffer from insufficient vision-text alignment because of the dual-stream structure of CLIP and yield inferior TAD results due to the lack of accurate action prior. In this paper, we target the above limitations and propose to learn multimodal Prompts and Text-Enhanced Actionness (mProTEA) for ZS-TAD. Specifically, we insert learnable layer-wise prompts into the vision and text branches of the frozen CLIP and establish a strong coupling between them, resulting in multimodal prompts that can boost cross-modal alignment. To ease computation costs, we propose to conduct multimodal prompt learning on an image recognition dataset with rich concepts (e.g., ImageNet) first and then keep them frozen during TAD fine-tuning. For improving TAD, we introduce text-enhanced actionness modeling, where we leverage the concise semantics of text to assist the calculation of class-agnostic actionness scores, to offer accurate prior information for both action classification and localization. With the above designs, our mProTEA excels in extensive TAD experiments, surpassing the strong competitor STALE by 5.1% on ActivityNet under the zero-shot setting and achieving state-of-the-art performance in conventional supervised scenarios. Ablation studies confirm the effectiveness of our proposals and show superior domain generalization of multimodal prompts learned on ImageNet against the other 10 image recognition datasets.
What problem does this paper attempt to address?