Progressive Graph Attention Network for Video Question Answering
Liang Peng,Shuangji Yang,Yi Bin,Guoqing Wang
DOI: https://doi.org/10.1145/3474085.3475193
2021-01-01
Abstract:Video question answering (Video-QA) is a task of answering a natural language question related to the content of a video. Existing methods generally explore the single interactions between objects or between frames, which are insufficient to deal with the sophisticated scenes in videos. To tackle this problem, we propose a novel model, termed Progressive Graph Attention Network (PGAT), which can jointly explore the multiple visual relations on objectlevel, frame-level and clip-level. Specifically, in the object-level relation encoding, we design two kinds of complementary graphs, one for learning the spatial and semantic relations between objects from the same frame, the other for modeling the temporal relations between the same object from different frames. The framelevel graph explores the interactions between diverse frames to record the fine-grained appearance change, while the clip-level graph models the temporal and semantic relations between various actions from clips. These different-level graphs are concatenated in a progressive manner to learn the visual relations from lowlevel to high-level. Furthermore, we for the first time identified that there are serious answer biases with TGIF-QA, a very large Video-QA dataset, and reconstructed a new dataset based on it to overcome the biases, called TGIF-QA-R. We evaluate the proposed model on three benchmark datasets and the new TGIF-QA-R, and the experimental results demonstrate that our model significantly outperforms other state-of-the-art models. Our codes and dataset are available at https://github.com/PengLiang- cn/PGAT.