Dense Video Captioning for Incomplete Videos

Xuan Dang,Guolong Wang,Kun Xiong,Zheng Qin
DOI: https://doi.org/10.1007/978-3-030-86383-8_53
2021-01-01
Abstract:Incomplete video or partially-missing video situations are rarely considered in video captioning research. Previous approaches are mainly trained and evaluated on complete video clip datasets where all the events involved are thoroughly observed. In this work, we formulate the issue of video content description for partially-missing videos. To tackle this challenge, we propose a Visual-Semantic Embedding with Context (VSEC) module to capture the missing visual content by jointly embedding the constructed contextual visual representation and corresponding textual annotation. We further employ a transformer-based captioning network to generate complete and coherent descriptions for the incomplete video. To validate the effectiveness of our method, we construct a new dataset based on ActivityNet Caption to imitate incomplete video situations in reality, named as ActivityNet Caption-P. We train and test our method both on ActivityNet Caption-P and achieve outstanding performances in most metrics.
What problem does this paper attempt to address?