Towards Reasoning Ability in Scene Text Visual Question Answering

Qingqing Wang,Liqiang Xiao,Yue Lu,Yaohui Jin,Hao He
DOI: https://doi.org/10.1145/3474085.3475390
2021-01-01
Abstract:Works on scene text visual question answering (TextVQA) always emphasize the importance of reasoning questions and image contents. However, we find current TextVQA models lack reasoning ability and tend to answer questions by exploiting dataset bias and language priors. Moreover, our observations indicate that recent accuracy improvement in TextVQA is mainly contributed by stronger OCR engines, better pre-training strategies and more Transformer layers, instead of newly proposed networks. In this work, towards the reasoning ability, we 1) conduct module-wise contribution analysis to quantitatively investigate how existing works improve accuracies in TextVQA; 2) design a gradient-based explainability method to explore why TextVQA models answer what they answer and find evidence for their predictions; 3) perform qualitative experiments to visually analyze models reasoning ability and explore potential reasons behind such a poor ability.
What problem does this paper attempt to address?