Video Question Answering Via Hierarchical Dual-Level Attention Network Learning.

Zhou Zhao,Jinghao Lin,Xinghua Jiang,Deng Cai,Xiaofei He,Yueting Zhuang
DOI: https://doi.org/10.1145/3123266.3123364
2017-01-01
Abstract:Video question answering is a challenging task in visual information retrieval, which provides the accurate answer from the referenced video contents according to the given question. However, the existing visual question answering approaches mainly tackle the problem of static image question answering, which may be ineffectively applied for video question answering directly, due to the insufficiency of modeling the video temporal dynamics. In this paper, we study the problem of video question answering from the viewpoint of hierarchical dual-level attention network learning. We obtain the object appearance and movement information in the video based on both frame-level and segment-level feature representation methods. We then develop the hierarchical duallevel attention networks to learn the question-aware video representations with word-level and question-level attention mechanisms. We next devise the question-level fusion attention mechanism for our proposed networks to learn the questionaware joint video representation for video question answering. We construct two large-scale video question answering datasets. The extensive experiments validate the effectiveness of our method.
What problem does this paper attempt to address?