Locate before Answering: Answer Guided Question Localization for Video Question Answering

Tianwen Qian,Ran Cui,Jingjing Chen,Pai Peng,Xiaowei Guo,Yu-Gang Jiang

2023-10-12

Abstract:Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works mostly achieve promising performances on short videos of duration within 15 seconds. For VideoQA on minute-level long-term videos, those methods are likely to fail because of lacking the ability to deal with noise and redundancy caused by scene changes and multiple actions in the video. Considering the fact that the question often remains concentrated in a short temporal range, we propose to first locate the question to a segment in the video and then infer the answer using the located segment only. Under this scheme, we propose "Locate before Answering" (LocAns), a novel approach that integrates a question locator and an answer predictor into an end-to-end model. During the training phase, the available answer label not only serves as the supervision signal of the answer predictor, but also is used to generate pseudo temporal labels for the question locator. Moreover, we design a decoupled alternative training strategy to update the two modules separately. In the experiments, LocAns achieves state-of-the-art performance on two modern long-term VideoQA datasets NExT-QA and ActivityNet-QA, and its qualitative examples show the reliable performance of the question localization.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper primarily aims to address the issue of Video Question Answering (VideoQA) in long videos (minute-level). Most existing methods perform well when dealing with videos shorter than 15 seconds, but they struggle with long videos due to noise and redundancy caused by scene changes and multiple actions. Specifically: 1. **Video Question Answering Task (VideoQA)**: The VideoQA task aims to answer questions posed in natural language based on video content, which is an important research direction in the field of visual and language understanding. 2. **Challenges of Long Videos**: Long videos contain complex semantic information and cannot be simply processed as a whole. Therefore, directly applying methods suitable for short videos to long videos leads to redundancy and noise issues, thereby affecting the model's performance. 3. **Locate before Answering**: The paper proposes a new paradigm of "Locate before Answering," which first identifies the segments in the video relevant to the question and then performs answer inference based solely on these segments. This approach can reduce interference from irrelevant parts and improve the model's interpretability. Through the above methods, the paper proposes a new model named LocAns, which combines a question localization module and an answer prediction module, and employs an alternating training strategy to optimize these two modules. Experimental results show that LocAns achieves state-of-the-art performance on three modern long video question answering datasets (NExT-QA, ActivityNet-QA, and AGQA).

Locate before Answering: Answer Guided Question Localization for Video Question Answering

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework

Video Question Answering Via Grounded Cross-Attention Network Learning.

Learning to Locate Visual Answer in Video Corpus Using Question

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

TVQA: Localized, Compositional Video Question Answering

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

Multichannel Attention Refinement for Video Question Answering.

Rethinking the Bottom-Up Framework for Query-Based Video Localization

Video Question Answering Via Gradually Refined Attention over Appearance and Motion

Temporal Moment Localization via Natural Language by Utilizing Video Question Answers as a Special Variant and Bypassing NLP for Corpora

Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Instance-sequence reasoning for video question answering

TLNet: Temporal Span Localization Network with Collaborative Graph Reasoning for Video Question Answering

Question-Led Object Attention for Visual Question Answering

Discovering Spatio-Temporal Rationales for Video Question Answering

Video Question Answering Via Multi-Granularity Temporal Attention Network Learning

ViLA: Efficient Video-Language Alignment for Video Question Answering

Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering

Question-Guided Erasing-Based Spatiotemporal Attention Learning for Video Question Answering