Towards Bridging Video and Language by Caption Generation and Sentence Localization.

Shaoxiang Chen
DOI: https://doi.org/10.1145/3474085.3481032
2021-01-01
Abstract:Various video understanding tasks (classification, tracking, action detection, etc.) have been extensively studied in the multimedia and computer vision communities over the recent years. While these tasks are important, we think that bridging video and language is a more natural and intuitive way to interact with videos. Caption generation and sentence localization are two representative tasks for connecting video and language, and my research is focused on these two tasks. In this extended abstract, I present approaches for tackling each of these tasks by exploiting fine-grained information in videos, together with ideas about how these two tasks can be connected. So far, my work have demonstrated that these two tasks share a common foundation, and by connecting them to form a cycle, video and language can be more closely bridged. Finally, several challenges and future directions will be discussed.
What problem does this paper attempt to address?