From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering

Jiangtong Li,Li Niu,Liqing Zhang
DOI: https://doi.org/10.48550/arXiv.2205.14895
2022-05-30
Abstract:Video understanding has achieved great success in representation learning, such as video caption, video object grounding, and video descriptive question-answer. However, current methods still struggle on video reasoning, including evidence reasoning and commonsense reasoning. To facilitate deeper video understanding towards video reasoning, we present the task of Causal-VidQA, which includes four types of questions ranging from scene description (description) to evidence reasoning (explanation) and commonsense reasoning (prediction and counterfactual). For commonsense reasoning, we set up a two-step solution by answering the question and providing a proper reason. Through extensive experiments on existing VideoQA methods, we find that the state-of-the-art methods are strong in descriptions but weak in reasoning. We hope that Causal-VidQA can guide the research of video understanding from representation learning to deeper reasoning. The dataset and related resources are available at \url{<a class="link-external link-https" href="https://github.com/bcmi/Causal-VidQA.git" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Computation and Language,Multimedia
What problem does this paper attempt to address?