Abstract:Text-based visual question answering (TextVQA) faces the significant challenge of avoiding redundant relational inference. To be specific, a large number of detected objects and optical character recognition (OCR) tokens result in rich visual relationships. Existing works take all visual relationships into account for answer prediction. However, there are three observations: (1) a single subject in the images can be easily detected as multiple objects with distinct bounding boxes (considered repetitive objects). The associations between these repetitive objects are superfluous for answer reasoning; (2) two spatially distant OCR tokens detected in the image frequently have weak semantic dependencies for answer reasoning; and (3) the co-existence of nearby objects and tokens may be indicative of important visual cues for predicting answers. Rather than utilizing all of them for answer prediction, we make an effort to identify the most important connections or eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. As spatial factors for relation measurement, we employ spatial distance, geometric dimension, overlap area, and DIoU for spatially aware pruning. We consider three visual relationships for graph learning: object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a progressive graph learning architecture that verifies the pivotal relations in the correlated object-token sparse graph, and then in the respective object-based sparse graph and token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances. And some visualization results further demonstrate the interpretability of our method.

Weakly-Supervised 3D Spatial Reasoning for Text-based Visual Question Answering

Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering

Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA

An effective spatial relational reasoning networks for visual question answering

CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes

3D-Aware Visual Question Answering about Parts, Poses and Occlusions

Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation

Enhancing scene‐text visual question answering with relational reasoning, attention and dynamic vocabulary integration

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering.

Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Generating Visual Spatial Description via Holistic 3D Scene Understanding

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering

Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment

3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

Space3D-Bench: Spatial 3D Question Answering Benchmark

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering.

RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering