QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Hui Liu,Xiaojun Wan
DOI: https://doi.org/10.1145/3652583.3658061
2024-01-01
Abstract:Video captioning is the task of describing video content using natural sentences. While recent models have shown significant improvements in metrics, there are still some unresolved issues. Model-generated captions often contain factual errors and omit important details. In contrast, human-written captions excel in accurately and comprehensively describing the video content. In this work, we propose a novel method that utilizes question answering (QA) techniques to enhance video captioning models. We start by generating QA pairs from both videos and human-written captions. We propose a QA-enhanced captioning model to better leverage QA information. Finally, we employ reinforcement learning to train the model to maximize a QA reward. By incorporating QA-related techniques, our model can generate more accurate and comprehensive video captions. We conduct experiments on three datasets, namely ActivityNet Captions, YouCookII and MSR-VTT. The experimental results, ablation studies and human evaluations demonstrate the advantages of our method.
What problem does this paper attempt to address?