SPTNET: Span-based Prompt Tuning for Video Grounding

Yiren Zhang,Yuanwu Xu,Mohan Chen,Yuejie Zhang,Rui Feng,Shang Gao
DOI: https://doi.org/10.1109/ICME55011.2023.00477
2023-01-01
Abstract:When a Pre-trained Language Model (PLM) is adopted in video grounding task, it usually acts as a text encoder without having its knowledge fully utilized. Also, there exists an inconsistency problem between the pre-training and downstream objectives. To solve the issues, we propose a new paradigm, named Span-based Prompt Tuning (SPTNet). It can convert the video grounding task into a cloze form. Specifically, a query is first changed into a form with mask token by a template, then the video and the query embeddings are integrated through a cross-modal transformer. The start and end points of the query matching time span are predicted with the embedding of the mask token. Experimental results on two public benchmarks ActivityNet Captions and Charades-STA show that our SPTNet achieves surpassing performance compared with state-of-the-art methods.
What problem does this paper attempt to address?