Joint Multi-Scale Information and Long-Range Dependence for Video Captioning

Zhongyi Zhai,Xiaofeng Chen,Yishuang Huang,Lingzhong Zhao,Bo Cheng,Qian He
DOI: https://doi.org/10.1007/s13735-023-00303-7
2023-01-01
International Journal of Multimedia Information Retrieval
Abstract:Since deep learning methods have achieved great success in both computer vision and natural language processing, video captioning tasks based on these two fields have also attracted extensive attention. Video captioning is a challenging task, which aims to present video information in the form of natural language to enhance video intelligibility. Most of the current researches in video captioning focus on the behavioral description of the main objects of the video, especially on the holistic understanding of the content. This trend makes most video captioning efforts ignoring the characteristics of smaller objects in the video, resulting in ambiguous, imprecise, or even fundamentally wrong descriptions. In this paper, a novel video captioning method MSLR is proposed, which improves the accuracy of video description by extracting features of video objects with different granularity and preserving long-range temporal dependencies. Specifically, the proposed method performs convolution operations at different scales to obtain different granular spatial features of videos and then fuses them to generate a unified spatial representation. On this basis, a temporal extraction network is further constructed using non-local blocks to preserve the long-range dependencies of videos. Evaluated on two popular benchmark datasets, the experimental results demonstrate the superiority of MSLR over the previous state-of-the-art methods, and the effectiveness of MSLR components is verified through ablation experiments and text evaluation.
What problem does this paper attempt to address?