Exploring Object-Centered External Knowledge for Fine-Grained Video Paragraph Captioning

Guorui Yu,Yimin Hu,Yiqian Xu,Yuejie Zhang,Rui Feng,Tao Zhang,Shang Gao
DOI: https://doi.org/10.1109/icassp48485.2024.10448104
2024-01-01
Abstract:Video paragraph captioning task aims to generate a detailed, fluent and relevant paragraph for a given video. Prior studies often focus on isolating visual objects (potential main components in a sentence) from the overall video content. They rarely explore the latent semantic relations between objects and high-level video concepts, resulting in dull or even incorrect descriptions. To create fine-grained and contextually relevant paragraph captions, we propose a novel framework that constructs a concept graph from a commonsense knowledge base and infers richer semantic meaning from the visual objects. Moreover, we employ a Vision-Guided Concept Selection Network that incorporates an under-sentence supervision mechanism to align the external knowledge with the visual information. Through extensive experiments on ActivityNet captions and YouCook2, the effectiveness of our method is demonstrated compared to state-of-the-art methods.
What problem does this paper attempt to address?