Remember and forget: video and text fusion for video question answering

Feng Gao,Yuanyuan Ge,Yongge Liu
DOI: https://doi.org/10.1007/s11042-018-5868-x
IF: 2.577
2018-03-27
Multimedia Tools and Applications
Abstract:Video question answering (Video QA) has received much attention in recent years. It can answer questions according to the visual content of a video clip. Video QA task can be solved only according to the video data. But if the video clip has some relevant text information, It can also be solved by using the fused video and text data. How to select the useful region features from the video frames and select the useful text features from the text information needs to be solved. And how to fuse the video and text features also needs to be solved. Therefore, we propose a forget memory network to solve these problems. The forget memory network with video framework can solve Video QA task only according to the video data. It can select the useful region features for the question and forget the irrelevant region features from the video frames. The forget memory network with video and text framework can extract the useful text features and forget the irrelevant text features for the question. And it can fuse the video and text data to solve Video QA task. The fused video and text features can help improve the experimental performance.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?