Video Description Method Based on Multidimensional and Multimodal Information

DING Enjie,LIU Zhongyu,LIU Yafeng,YU Wanli
DOI: https://doi.org/10.11959/j.issn.1000-436x.2020037
2020-01-01
Abstract:In order to solve the problem of complex information representation in automatic video description tasks,a multi-dimensional and multi-modal visual feature extraction and fusion method was proposed.Firstly,multi-dimensional features such as static and dynamic attributes of the video sequence were extracted by transfer learning,and the image description algorithm was also used to extract the semantic information of the key frames in the video.By doing this,the video features extraction was carried out.Then,multi-layer long and short memory networks were used to fuse multi-dimensional and multi-modal information,and finally generated a language description of the video content.Compared with the existing methods,experimental simulations results show that the proposed method achieves better results in the video automatic description task.
What problem does this paper attempt to address?