Visual Robustness Benchmark for Visual Question Answering (VQA)

Md Farhan Ishmam,Ishmam Tashdeed,Talukder Asir Saadat,Md Hamjajul Ashmafee,Abu Raihan Mostofa Kamal,Md. Azam Hossain
2024-10-29
Abstract:Can Visual Question Answering (VQA) systems perform just as well when deployed in the real world? Or are they susceptible to realistic corruption effects e.g. image blur, which can be detrimental in sensitive applications, such as medical VQA? While linguistic or textual robustness has been thoroughly explored in the VQA literature, there has yet to be any significant work on the visual robustness of VQA models. We propose the first large-scale benchmark comprising 213,000 augmented images, challenging the visual robustness of multiple VQA models and assessing the strength of realistic visual corruptions. Additionally, we have designed several robustness evaluation metrics that can be aggregated into a unified metric and tailored to fit a variety of use cases. Our experiments reveal several insights into the relationships between model size, performance, and robustness with the visual corruptions. Our benchmark highlights the need for a balanced approach in model development that considers model performance without compromising the robustness.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of **visual robustness** when visual question answering (VQA) systems are deployed in the real world. Specifically, the paper focuses on the following points: 1. **Actual performance of VQA systems**: Existing VQA systems perform well under ideal conditions, but in the real world, they may be affected by visual interferences such as image blurring and brightness changes, resulting in performance degradation. Especially in sensitive application areas (such as medical VQA), this impact can be fatal. 2. **Evaluation of visual robustness**: Although text robustness has been widely studied in the VQA field, there are relatively few studies on visual robustness, and there is a lack of large - scale benchmark tests and evaluation metrics. Therefore, the paper proposes the first large - scale visual robustness benchmarking framework, which contains 213,000 enhanced images, to evaluate the robustness of multiple VQA models to real - world visual interferences. 3. **Robustness evaluation metrics**: In order to comprehensively evaluate visual robustness, the paper designs 5 new evaluation metrics and aggregates them into a unified measure - **Visual Robustness Error (VRE)**. These metrics can be customized according to specific application scenarios to meet different requirements. 4. **Balance between model performance and robustness**: Through experiments, the paper reveals the relationship between model size, performance and robustness, emphasizing the need to balance these two in the model development process, and not simply pursue high accuracy while ignoring robustness. ### Summary The core problem of the paper is to explore and evaluate the performance of VQA systems in the face of real - world visual interferences, propose a comprehensive evaluation framework and metric system, fill the gaps in existing research, and provide important references and directions for future research.