Video Grounding and Its Generalization

Xin Wang,Xiaohan Lan,Wenwu Zhu
DOI: https://doi.org/10.1145/3503161.3546971
2022-01-01
Abstract:Video grounding aims to ground a sentence query in a video by determining the start and end timestamps of the semantically matched segment. It is a fundamental and essential vision-and-language problem widely investigated by the research community, and it also has potential values applied in industrial domains. This tutorial will give a detailed introduction about the development and evolution of this task, point out the limitations of existing benchmarks, and extend such a text-based grounding task to more general scenarios, especially how it guides the learning of other video-language tasks like video question answering based on event grounding. This topic is at the core of the scope of ACM Multimedia, and is attractive to MM audience from both academia and industry.
What problem does this paper attempt to address?