Abstract:Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and heavy reliance on scene-text recognition. In this paper, we propose to study Grounded TextVideoQA by forcing models to answer questions and spatio-temporally localize the relevant scene-text regions, thus decoupling QA from scenetext recognition and promoting research towards interpretable QA. The task has three-fold significance. First, it encourages scene-text evidence versus other short-cuts for answer predictions. Second, it directly accepts scene-text regions as visual answers, thus circumventing the problem of ineffective answer evaluation by stringent string matching. Third, it isolates the challenges inherited in VideoQA and scene-text recognition. This enables the diagnosis of the root causes for failure predictions, e.g., wrong QA or wrong scene-text recognition? To achieve Grounded TextVideoQA, we propose the T2S-QA model that highlights a disentangled temporal-to-spatial contrastive learning strategy for weakly-supervised scene-text grounding and grounded TextVideoQA. To facilitate evaluation, we construct a new dataset ViTXT-GQA which features 52K scene-text bounding boxes within 2.2K temporal segments related to 2K questions and 729 videos. With ViTXT-GQA, we perform extensive experiments and demonstrate the severe limitations of existing techniques in Grounded TextVideoQA. While T2S-QA achieves superior results, the large performance gap with human leaves ample space for improvement. Our further analysis of oracle scene-text inputs posits that the major challenge is scene-text recognition. To advance the research of Grounded TextVideoQA, our dataset and code are at \url{<a class="link-external link-https" href="https://github.com/zhousheng97/ViTXT-GQA.git" rel="external noopener nofollow">this https URL</a>}

ICDAR 2023 Competition on Born Digital Video Text Question Answering

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

ICDAR 2023 Video Text Reading Competition for Dense and Small Text

Video Question Answering: Datasets, Algorithms and Challenges

The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind

VQA$^2$:Visual Question Answering for Video Quality Assessment

ICDAR 2021 Competition on Scene Video Text Spotting

ICDAR 2021 Competition on Document VisualQuestion Answering

TVQA: Localized, Compositional Video Question Answering

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

Scene-Text Grounding for Text-Based Video Question Answering

A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

TG-VQA: Ternary Game of Video Question Answering

ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

DSText V2: A Comprehensive Video Text Spotting Dataset for Dense and Small Text

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

Task-driven Visual Saliency and Attention-based Visual Question Answering

DuReadervis: A Chinese Dataset for Open-domain Document Visual Question Answering

First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge

LingoQA: Visual Question Answering for Autonomous Driving