Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos

Houlun Chen,Xin Wang,Hong Chen,Zihan Song,Jia,Wenwu Zhu
DOI: https://doi.org/10.48550/arxiv.2312.17117
2023-01-01
Abstract:Temporal Sentence Grounding (TSG), which aims to localize moments from videosbased on the given natural language queries, has attracted widespreadattention. Existing works are mainly designed for short videos, failing tohandle TSG in long videos, which poses two challenges: i) complicated contextsin long videos require temporal reasoning over longer moment sequences, and ii)multiple modalities including textual speech with rich information requirespecial designs for content understanding in long videos. To tackle thesechallenges, in this work we propose a Grounding-Prompter method, which iscapable of conducting TSG in long videos through prompting LLM with multimodalinformation. In detail, we first transform the TSG task and its multimodalinputs including speech and visual, into compressed task textualization.Furthermore, to enhance temporal reasoning under complicated contexts, aBoundary-Perceptive Prompting strategy is proposed, which contains three folds:i) we design a novel Multiscale Denoising Chain-of-Thought (CoT) to combineglobal and local semantics with noise filtering step by step, ii) we set upvalidity principles capable of constraining LLM to generate reasonablepredictions following specific formats, and iii) we introduce one-shotIn-Context-Learning (ICL) to boost reasoning through imitation, enhancing LLMin TSG task understanding. Experiments demonstrate the state-of-the-artperformance of our Grounding-Prompter method, revealing the benefits ofprompting LLM with multimodal information for TSG in long videos.
What problem does this paper attempt to address?