All up to You: Controllable Video Captioning with a Masked Scene Graph

Zhen Yang,Lin Shang
DOI: https://doi.org/10.1007/978-3-031-20868-3_24
2022-01-01
Abstract:Controllable video captioning is generating video descriptions following designated control signals. However, most controllable video captioning models focus exclusively on contents of interest or descriptive syntax. In this paper, we propose to guide the video caption generation with a Masked Scene Graph (MSG). Formally, given a video and a MSG, which not only contains semantic contents nodes, but also implies the syntactic form in the graph structure. The MSG can be constructed manually or be modified from the original scene graph of a sampled frame, due to the motion information is hard to be captured by the frame scene graph, so we mask the relationship node to obtain a MSG. From the MSG, we propose a MSG encoder and adopt a masked autoregressive decoding algorithm, which is able to recognize semantics and syntax information of the graph structure. Extensive experiments demonstrate that our framework can achieve better performance and controllability than several strong baselines on MSVD and MSR-VTT benchmarks.
What problem does this paper attempt to address?