Abstract:In visual question answering (VQA), a machine must answer a question given an associated image. Recently, accessibility researchers have explored whether VQA can be deployed in a real-world setting where users with visual impairments learn about their environment by capturing their visual surroundings and asking questions. However, most of the existing benchmarking datasets for VQA focus on machine "understanding" and it remains unclear how progress on those datasets corresponds to improvements in this real-world use case. We aim to answer this question by evaluating discrepancies between machine "understanding" datasets (VQA-v2) and accessibility datasets (VizWiz) by evaluating a variety of VQA models. Based on our findings, we discuss opportunities and challenges in VQA for accessibility and suggest directions for future work.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to explore the performance differences between the "understanding" tasks of Visual Question Answering (VQA) systems and the practical application scenarios of assisting the visually impaired. Specifically, the researchers evaluated the performance of existing VQA model architectures on machine - understanding datasets (such as VQA - v2) and assistive - technology datasets (such as VizWiz) to reveal the gap between the two, and discussed the challenges and opportunities faced when applying machine - understanding VQA models to assist the visually impaired. By comparing and analyzing the performance of different models on these two datasets, the paper aims to provide directions for future research, especially on how to improve VQA systems to better serve the visually - impaired population. ### Main research contents: 1. **Performance evaluation**: The researchers selected seven different VQA models to conduct performance evaluations on the VQA - v2 and VizWiz datasets to understand the performance of the models on different tasks. 2. **Error analysis**: Through quantitative and qualitative analysis of errors on the VizWiz dataset, the researchers identified the difficulties of the models in handling certain types of questions, such as text recognition, color recognition, and blurry problems. 3. **Challenge classification**: The researchers used two metadata annotations to classify the questions in the VizWiz validation set. These classifications include visual - skill challenges (such as object recognition, text recognition, color recognition, counting) and image - question challenges (such as answer granularity, ambiguous questions, synonyms, missing answers, low - quality images, invalid questions, questions requiring expertise, subjective questions, insufficient answers). ### Main findings: 1. **Performance improvement**: As the model structure is improved, the performance improvement on VQA - v2 also brings about performance improvement on VizWiz, but the gap between the two is still significant. 2. **Over - fitting effect**: When controlling the dataset size, the performance improvement on VQA - v2 - sm is much greater than that on VizWiz, indicating the existence of over - fitting. 3. **Error analysis**: The models perform poorly when dealing with text - recognition - required and blurry problems. In addition, many errors are due to problems with evaluation metrics and annotated data, rather than the models themselves. 4. **Improvement directions**: Future research should focus more on data collection and model design for text - recognition and blurry problems, and at the same time develop more robust evaluation protocols to more accurately capture performance improvements. ### Conclusion: Although the performance improvement of machine - "understanding" VQA has improved the performance of assistive - technology VQA to a certain extent, the gap between the two is still significant. This indicates that if the research community continues to optimize only on challenging datasets such as VQA - v2, it may stop the progress in human - centered applications of this technology, and may even regress. Therefore, future research should pay more attention to the needs of practical application scenarios, especially in terms of data collection and model design.

What's Different between Visual Question Answering for Machine "Understanding" Versus for Accessibility?

Simple and Effective Visual Question Answering in a Single Modality

Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering

Visual question answering: Datasets, algorithms, and future challenges

Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation

Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool

Visual Question Answering As Reading Comprehension

Answer Them All! Toward Universal Visual Question Answering Models

Right this way: Can VLMs Guide Us to See More to Answer Questions?

The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

Multitask Learning for Visual Question Answering

On the Cognition of Visual Question Answering Models and Human Intelligence: A Comparative Study

Visual Question Answering Method Based on Counterfactual Thinking

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

A Comprehensive Survey on Visual Question Answering Datasets and Algorithms

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

Visual question answering: A survey of methods and datasets

WebQA: Multihop and Multimodal QA