Inferential Visual Question Generation

Chao Bi,Shuhui Wang,Zhe Xue,Shengbo Chen,Qingming Huang
DOI: https://doi.org/10.1145/3503161.3548055
2022-01-01
Abstract:The task of Visual Question Generation (VQG) aims to generate natural language questions for images. Many methods regard it as a reverse Visual Question Answering (VQA) task. They trained a data-driven generator on VQA datasets, which is hard to obtain questions that can challenge robots and humans. Other methods rely heavily on elaborate but expensive artificial preprocessing to generate. To overcome these limitations, we propose a method to generate inferential questions from the image with noisy captions. Our method first introduces a core scene graph generation module, which can align text features and salient visual features to the initial scene graph. It constructs a special core scene graph with expanded linkage outwards from the high-confidence nodes hop by hop. Next, a question generation module uses the core scene graph as a basis to instantiate the function templates, resulting in questions with varying inferential paths. Experiments show that the visual questions generated by our method are controllable in both content and difficulty, and demonstrate clear inferential properties. In addition, since the salient region, captions, and function templates can be replaced by human-customized ones, our method has strong scalability and potential for more interactive applications. Finally, we use our method to automatically build a new dataset, InVQA, containing about 120k images and 480k question-answer pairs, to facilitate the development of more versatile VQA models.
What problem does this paper attempt to address?