Abstract:Visual Question Answering (VQA) has been a popular task that combines vision and language, with numerous relevant implementations in literature. Even though there are some attempts that approach explainability and robustness issues in VQA models, very few of them employ counterfactuals as a means of probing such challenges in a model-agnostic way. In this work, we propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations. For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality, and we then evaluate the model's response against such counterfactual inputs. Finally, we qualitatively extract local and global explanations based on counterfactual responses, which are ultimately proven insightful towards interpreting VQA model behaviors. By performing a variety of perturbation types, targeting different parts of speech of the input question, we gain insights to the reasoning of the model, through the comparison of its responses in different adversarial circumstances. Overall, we reveal possible biases in the decision-making process of the model, as well as expected and unexpected patterns, which impact its performance quantitatively and qualitatively, as indicated by our analysis.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issues of interpretability and robustness in Visual Question Answering (VQA) models. Although existing VQA models have made significant progress in performance, their black-box nature leads to a lack of interpretability and fairness, posing risks in critical decision-making. Specifically, the paper explores the behavior of VQA models by introducing counterfactual perturbations and reveals the reasoning process of the models under different adversarial conditions. ### Main Contributions 1. **Designing Counterfactual Inputs**: The paper proposes a knowledge graph-based method to generate counterfactual inputs by structurally replacing words at the lexical level in the questions. This method is model-agnostic and can be applied to any VQA model. 2. **Obtaining Local Explanations**: By analyzing the responses to counterfactual inputs, the paper extracts changes in model behavior under specific perturbations to obtain local explanations. 3. **Extracting Global Explanations**: By summarizing the local behaviors across all questions, the paper extracts global explanations, revealing the overall behavior patterns and potential weaknesses of the model. ### Method Overview 1. **Dataset and Model Selection**: The paper uses the Visual Genome (VG) and VQA-v2 datasets and selects ViLT as the pre-trained VQA model. 2. **Counterfactual Perturbations**: Guided by external knowledge sources (such as WordNet and color correlation hierarchies), the paper performs lexical-level replacements targeting nouns, verbs, and adjectives. - **Synonym Replacement**: For example, replacing "talk" with "speak" or "small" with "minuscule." - **Hypernym/Hyponym Replacement**: For example, replacing "dog" with its hypernym "canine" or hyponym "labrador." - **Sibling Word Replacement**: For example, replacing "carrot" with "radish." - **Color Replacement**: Including maximum color replacement (e.g., replacing "violet" with "deepskyblue") and minimum color replacement (e.g., replacing "violet" with "orchid"). - **Noun Deletion**: Randomly deleting a noun from the question. 3. **Evaluation Method**: The robustness of the model is evaluated by comparing the accuracy of the original questions with the counterfactual questions. ### Experimental Results 1. **Accuracy Changes**: In all experiments, the accuracy dropped by approximately 15-20% or more from the original questions to the counterfactual questions. 2. **Local Explanations**: By analyzing responses to specific perturbations, the paper found that the model exhibited biases in certain colors (e.g., "gray" and "silver") while performing more reasonably in other colors (e.g., "green" and "red"). 3. **Global Explanations**: By summarizing local behaviors, the paper extracted global rules, revealing the model's strengths and weaknesses in handling different concepts. ### Conclusion By introducing counterfactual perturbations, the paper successfully reveals the behavior changes of VQA models under different adversarial conditions, providing a deeper understanding of the model's decision-making process. These findings help improve the interpretability and robustness of VQA models, making them more reliable in practical applications.

Knowledge-Based Counterfactual Queries for Visual Question Answering

On the Flip Side: Identifying Counterexamples in Visual Question Answering

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Knowledge-Augmented Visual Question Answering With Natural Language Explanation

Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering

COIN: Counterfactual Image Generation for VQA Interpretation

Perceptual Visual Reasoning with Knowledge Propagation

Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.

Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA

Find The Gap: Knowledge Base Reasoning For Visual Question Answering

Multitask Learning for Visual Question Answering

Counterfactual Samples Synthesizing for Robust Visual Question Answering

Query and Attention Augmentation for Knowledge-Based Explainable Reasoning

Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA

Knowledge Condensation and Reasoning for Knowledge-based VQA

Proposing Plausible Answers for Open-ended Visual Question Answering

A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models

Boosting Visual Question Answering with Context-aware Knowledge Aggregation

Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for Knowledge-based Visual Question Answering