A Region-based Document VQA

Xinya Wu,Duo Zheng,Ruonan Wang,Jiashen Sun,Minzhen Hu,Fangxiang Feng,Xiaojie Wang,Huixing Jiang,Fan Yang
DOI: https://doi.org/10.1145/3503161.3548172
2022-01-01
Abstract:Practical Document Visual Question Answering (DocVQA) needs not only to recognize and extract the document contents, but also reason on them for answering questions. However, previous DocVQA data mainly focuses on in-line questions, where the answers could be directly extracted after locating keywords in the documents, which needs less reasoning. This paper therefore builds a large-scale dataset named Region-based Document VQA (RDVQA), which includes more practical questions for DocVQA. We then propose a novel Reason-over-In-region-Question-answering (ReIQ) model for addressing the problems. It is a pre-training-based model, where a Spatial-Token Pre-trained Model (STPM) is employed as the backbone. Two novel pre-training tasks, Masked Text Box Regression and Shuffled Triplet Reconstruction, are proposed to learn the entailment relationship between text blocks and tokens as well as contextual information, respectively. Moreover, a DocVQA State Tracking Module (DocST) is also proposed to track the DocVQA state in the fine-tuning stage. Experimental results show that our model improves the performance onRDVQA significantly, although more work should be done for practical DocVQA as shown inRDVQA.
What problem does this paper attempt to address?