Cross-Graph Transformer Network for Temporal Sentence Grounding

Jiahui Shang,Ping Wei,Nanning Zheng
DOI: https://doi.org/10.1007/978-3-031-44223-0_28
2023-01-01
Abstract:Temporal sentence grounding aims to retrieve moments associated with the given sentences in untrimmed videos, which is a multi-modal problem and needs the adequate understanding of the sentence and video structure as well as the accurate interaction of the two modals. In this paper, we propose a cross-graph Transformer network (CGTN) model to address this problem, where the sentence is taken as a dependency tree and the video as a graph, according to their non-linear structures. Based on the graph structures, we design the self-graph attention and cross-graph attention to model the relationship between the nodes in the graph and cross the graphs. We test the proposed model on two challenging datasets. Extensive experiments demonstrate the strength of our method.
What problem does this paper attempt to address?