Enhancing the Alignment Between Target Words and Corresponding Frames for Video Captioning.

Yunbin Tu,Chang Zhou,Junjun Guo,Shengxiang Gao,Zhengtao Yu
DOI: https://doi.org/10.1016/j.patcog.2020.107702
IF: 8
2021-01-01
Pattern Recognition
Abstract:•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods.
What problem does this paper attempt to address?