Cross-language Multimodal Scene Semantic Guidance and Leap Sampling for Video Captioning

Bo Sun,Yong Wu,Yijia Zhao,Zhuo Hao,Lejun Yu,Jun He
DOI: https://doi.org/10.1007/s00371-021-02309-w
2023-01-01
Abstract:In recent years, video captioning, which uses natural language to describe video content, has achieved encouraging results. However, most of the previous studies in this area have focused on directly decoding video encoding and have thus rarely explored the role of scene semantics in caption generation, especially cross-language and multimodal. Obviously, the same video can be described with different languages, which have different forms and are inherently related. Meanwhile, despite high evaluation scores, some generated captions cannot represent the video content with many nonentity words. Based on the analysis, in this paper, we propose a cross-language scene semantic guidance caption model. It first learns the high-level scene semantics of a video in different languages, from which multilanguage features are extracted. Then, the features characterize the video content and guide the generated captions. They make the captions converge toward the video content. In addition, we also apply a leap sampling method for learning entity words in the model so as to better represent the video content. Moreover, experiments on the public MSR-VTT and VATEX datasets show that our model is effective. Finally, we establish a multilingual student classroom behavior caption dataset under an education scenario, providing a basis for research into captioning tasks in the education area. We also apply our model to this dataset and achieve certain results. The dataset is available to download online: https://github.com/BNU-Wu/Student-Class-Behavior-Dataset/tree/master .
What problem does this paper attempt to address?