Exploring Union and Intersection of Visual Regions for Generating Questions, Answers, and Distractors

Wenjian Ding,Yao Zhang,Jun Wang,Adam Jatowt,Zhenglu Yang
DOI: https://doi.org/10.18653/v1/2024.emnlp-main.88
2024-01-01
Abstract:Multiple-choice visual question answering (VQA) is to automatically choose a correct answer from a set of choices after reading an image. Existing efforts have been devoted to a separate generation of an image-related question, a correct answer, or challenge distractors. By contrast, we turn to a holistic generation and optimization of questions, answers, and distractors (QADs) in this study. This integrated generation strategy eliminates the need for human curation and guarantees information consistency. Furthermore, we first propose to put the spotlight on different image regions to diversify QADs. Accordingly, a novel framework ReBo is formulated in this paper. ReBo cyclically generates each QAD based on a recurrent multimodal encoder, and each generation is focusing on a different area of the image compared to those already concerned by the previously generated QADs. In addition to traditional VQA comparisons with state-of-the-art approaches, we also validate the capability of ReBo in generating augmented data to benefit VQA models.
What problem does this paper attempt to address?