VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

Chongyan Chen,Samreen Anjum,Danna Gurari
2023-08-25
Abstract:Visual question answering is a task of predicting the answer to a question about an image. Given that different people can provide different answers to a visual question, we aim to better understand why with answer groundings. We introduce the first dataset that visually grounds each unique answer to each visual question, which we call VQAAnswerTherapy. We then propose two novel problems of predicting whether a visual question has a single answer grounding and localizing all answer groundings. We benchmark modern algorithms for these novel problems to show where they succeed and struggle. The dataset and evaluation server can be found publicly at <a class="link-external link-https" href="https://vizwiz.org/tasks-and-datasets/vqa-answer-therapy/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem addressed in this paper is whether different answers given by different respondents to the same visual question in the Visual Question Answering (VQA) task stem from different descriptions of the visual content in the image. Specifically, the research team created a dataset named VQA-AnswerTherapy, which annotates the visual evidence for all valid answers to each visual question. Based on this dataset, the paper introduces two new algorithmic challenges: 1. **Single Answer Grounding Challenge**: Predict whether all valid answers to a visual question describe the same visual evidence. 2. **Answer Grounding Challenge**: Locate the visual evidence corresponding to all valid answers to a visual question. Through these two challenges, the paper aims to reveal the performance of modern algorithms in handling these issues and to highlight their successes and failures. Additionally, the work explores how annotation differences affect the VQA task and proposes methods to improve VQA systems' understanding and handling of these differences, thereby enhancing the systems' interpretability and reliability. This work has direct practical application value, especially for the visually impaired user group.