Abstract:Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models, and are able to learn from and process multi-modal vision-text input. While modern VLMs perform well on a number of standard image classification and image-text matching tasks, they still struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. Moreover, while they might be very brittle to small variations in instructions and/or evaluation protocols, existing benchmarks fail to evaluate their robustness (or rather the lack of it). In order to couple challenging VL scenarios with comprehensive robustness evaluation, we introduce DARE, Diverse Visual Question Answering with Robustness Evaluation, a carefully created and curated multiple-choice VQA benchmark. DARE evaluates VLM performance on five diverse categories and includes four robustness-oriented evaluations based on the variations of: prompts, the subsets of answer options, the output format and the number of correct answers. Among a spectrum of other findings, we report that state-of-the-art VLMs still struggle with questions in most categories and are unable to consistently deliver their peak performance across the tested robustness evaluations. The worst case performance across the subsets of options is up to 34% below the performance in the standard case. The robustness of the open-source VLMs such as LLaVA 1.6 and Idefics2 cannot match the closed-source models such as GPT-4 and Gemini, but even the latter remain very brittle to different variations.

Variational Open-Domain Question Answering

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models

End-to-End Training of Neural Retrievers for Open-Domain Question Answering

Answerability in Retrieval-Augmented Open-Domain Question Answering

An Empirical Study on the Language Modal in Visual Question Answering

SS-BERT: A Semantic Information Selecting Approach for Open-Domain Question Answering

Improving Retrieval Augmented Open-Domain Question-Answering with Vectorized Contexts

Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation.

Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool

L2R-QA: An Open-Domain Question Answering Framework

Reimagining Retrieval Augmented Language Models for Answering Queries

Visual Explanation for Open-Domain Question Answering with BERT.

DARE: Diverse Visual Question Answering with Robustness Evaluation

A Bi-level representation learning model for medical visual question answering

Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

Variational Reasoning for Question Answering With Knowledge Graph

Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models

You Only Need One Model for Open-domain Question Answering

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering