Video Question Generation for Dynamic Changes
Jiayuan Xie,Jiali Chen,Zhenghao Liu,Yi Cai,Qingbao Huang,Qing Li
DOI: https://doi.org/10.1109/tcsvt.2024.3391415
2024-01-01
Abstract:Video question generation task aims to generate meaningful questions about a video targeting an answer. Existing methods merely focus on the static appearance features in the image frames or simply identify a motion in the video to ask general questions. However, a video contains dynamically changing visual content that deserves to be questioned, e.g., changes in object motions, object states and relationships among objects, which is more practical and closer to the dynamic world we live in. In this paper, we propose a difference-aware video question generation model that aims to generate questions about temporal differences in the video, i.e., capturing the dynamic changes between image frames of a video to ask questions. To capture the dynamic changes between image frames, we utilize a temporal difference extractor to localize the differences for each frame pair of a video through an attention mechanism. Then, we introduce an answer-aware module to capture the answer-related image frame pair containing their differences for question generation, which aims to guide our model to focus on answer-related content for questioning. Finally, the output of the answer-aware module is sent to a decoder module to generate questions. Extensive experiments on SVQA and MSVD-QA datasets show that the proposed model outperforms state-of-the-art models, e.g., our model achieves at least 17.1% improvement over existing models in the SVQA dataset. This is because our model can generate questions similar to ground truths that involve changes between image frames in videos. Our code is available at https://github.com/Gary-code/D-VQG.