Movie Fill in the Blank by Joint Learning from Video and Text with Adaptive Temporal Attention.

Jie Chen,Jie Shao,Chengkun He
DOI: https://doi.org/10.1016/j.patrec.2018.06.030
IF: 4.757
2018-01-01
Pattern Recognition Letters
Abstract:Video understanding is a challenging problem and it attracts a lot of research attention. Lately, a new task called movie fill in the blank (MovieFIB) is proposed. In this task, given a movie clip and a description which has one blank, we need to predict the word in the blank accurately. Previous studies make many contributions to tackling this problem. However, some of them do not utilize the relationship between words and video frames, and some others treat visual information as essential elements for blank word prediction, which fail to distinguish the effects of texts before and after the blank. To overcome the limitations, in this paper we propose to use adaptive temporal attention and fuse text information with attention. We first extract video and word features. Then, adaptive temporal attention is used to update original description. For the updated description, we extract its text information. Attention mechanism is applied to fuse text information. Finally, we use adaptive temporal attention to predict the blank word. Extensive experiments demonstrate that our model achieves satisfactory performance. (c) 2018 Elsevier B.V. All rights reserved.
What problem does this paper attempt to address?