Abstract: Text-based Visual Question Answering~(TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts. In most cases, the texts naturally attach to the surface of the objects. Therefore, spatial reasoning between texts and objects is crucial in TextVQA. However, existing approaches are constrained within 2D spatial information learned from the input images and rely on transformer-based architectures to reason implicitly during the fusion process. Under this setting, these 2D spatial reasoning approaches cannot distinguish the fine-grain spatial relations between visual objects and scene texts on the same image plane, thereby impairing the interpretability and performance of TextVQA models. In this paper, we introduce 3D geometric information into a human-like spatial reasoning process to capture the contextual knowledge of key objects step-by-step. %we formulate a human-like spatial reasoning process by introducing 3D geometric information for capturing key objects' contextual knowledge. To enhance the model's understanding of 3D spatial relationships, Specifically, (i)~we propose a relation prediction module for accurately locating the region of interest of critical objects; (ii)~we design a depth-aware attention calibration module for calibrating the OCR tokens' attention according to critical objects. Extensive experiments show that our method achieves state-of-the-art performance on TextVQA and ST-VQA datasets. More encouragingly, our model surpasses others by clear margins of 5.7\% and 12.1\% on questions that involve spatial reasoning in TextVQA and ST-VQA valid split. Besides, we also verify the generalizability of our model on the text-based image captioning task.

Towards Reasoning Ability in Scene Text Visual Question Answering.

Maintaining Reasoning Consistency in Compositional Visual Question Answering

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Beyond OCR + VQA: Towards End-to-End Reading and Reasoning for Robust and Accurate TextVQA

Convincing Rationales for Visual Question Answering Reasoning

Enhancing scene‐text visual question answering with relational reasoning, attention and dynamic vocabulary integration

Interpretable Visual Question Answering via Reasoning Supervision

On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering

VisQA: X-raying Vision and Language Reasoning in Transformers

Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering

Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR

Weakly-Supervised 3D Spatial Reasoning for Text-based Visual Question Answering

Graph Reasoning Networks for Visual Question Answering

A Symbolic-Neural Reasoning Model for Visual Question Answering

Towards Reasoning-Aware Explainable VQA

Towards VQA Models That Can Read

Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering

In vitro formation of crystalline apatite by matrix vesicles isolated from rachitic rat epiphyseal cartilage.

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering.

SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering