Multi-attention mechanism for Chinese description of videos

Hu Liu,Junxiu Wu,Jiabin Yuan
DOI: https://doi.org/10.1145/3445815.3445845
2020-12-11
Abstract:Using natural language to describe videos is a hot topic in the field of natural language processing and computer vision. However, most of the video description tasks are to generate English descriptions now, rarely to generate Chinese descriptions. This paper explores the process of generating Chinese descriptions for videos. An improved model of video description is proposed in this paper, which combines multi-modal features and multi-attention mechanism. The model extracts video information from global features and fine-grained features and uses the multi-attention mechanism to focus more important video information in the decoding stage, which can further improve the richness and accuracy of the generated descriptions. The model is applied to the extended Chinese corpus of MSVD (Microsoft Research video description corpus), and the highest METEOR value obtained is still 9.6% higher than the best result of video Chinese description on MSVD found at present. The model also achieves an advanced result compared with many state-of-the-art methods in English environment.
What problem does this paper attempt to address?