Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Juncheng Li,Junlin Xie,Long Qian,Linchao Zhu,Siliang Tang,Fei Wu,Yi Yang,Yueting Zhuang,Xin Eric Wang
DOI: https://doi.org/10.1109/cvpr52688.2022.00304
2022-01-01
Abstract:Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence. Thanks to the semantic diversity of natural language descriptions, temporal grounding allows activity grounding beyond pre-defined classes and has received increasing attention in recent years. The semantic diversity is rooted in the principle of compositionality in lin-guistics, where novel semantics can be systematically described by combining known words in novel ways ( compositional generalization ). However, current temporal grounding datasets do not specifically test for the compositional generalizability. To systematically measure the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i.e ., Charades-CG and ActivityNet-CG. Evaluating the state-of-the-art methods on our new dataset splits, we empirically find that they fail to generalize to queries with novel combinations of seen words. To tackle this challenge, we propose a variational cross-graph reasoning framework that explicitly decomposes video and language into multiple structured hierarchies and learns fine-grained semantic correspondence among them. Experiments illustrate the superior compositional generalizability of our approach. The repository of this work is at https://github.com/YYJMJC/ Compositional-Temporal-Grounding .
What problem does this paper attempt to address?