Structured Coding Based on Semantic Disambiguation for Video Captioning

Bo Sun,Yong Wu,Jun He,Lejun Yu
DOI: https://doi.org/10.2139/ssrn.4174916
2022-01-01
SSRN Electronic Journal
Abstract:In recent years, video captioning, which uses natural language to describe video content, has achieved encouraging results. However, most previous studies focused on frame-level coding, which undoubtedly ignores the impact of objects and their relationships in the frame. Obviously, captions are inferences and profiles of objects and their relationships in videos, while knowledge graphs are collections of various object concepts and their interrelationships. Based on this analysis, we propose a video captioning method based on semantic disambiguation structured coding. First, common sense reasoning is performed on objects detected in videos through knowledge graphs to construct conceptual semantic graphs of videos, and relational learning is performed through graph neural networks. On this basis, because there may be multiple relationships between the same pair of objects, a method is proposed to dynamically learn the most relevant relationship from multiple relationships by using the scene semantic information of the video to realize the concept semantic graph of semantic disambiguation. In addition, for better relationship learning, the caption is parsed into the scene graph and matched with the conceptual semantic graph to enhance the fitting of the relationship toward the caption direction. Finally, we conduct experiments on the MSR-VTT dataset, the ActivityNet Captions dataset and our own dataset of student classroom behavior caption to verify the effectiveness of the model.
What problem does this paper attempt to address?