Abstract:Visual question answering with natural language explanation (VQA-NLE) is a challenging task that requires models to not only generate accurate answers but also to provide explanations that justify the relevant decision-making processes. This task is accomplished by generating natural language sentences based on the given question-image pair. However, existing methods often struggle to ensure consistency between the answers and explanations due to their disregard of the crucial interactions between these factors. Moreover, existing methods overlook the potential benefits of incorporating additional knowledge, which hinders their ability to effectively bridge the semantic gap between questions and images, leading to less accurate explanations. In this paper, we present a novel approach denoted the knowledge-based iterative consensus VQA-NLE (KICNLE) model to address these limitations. To maintain consistency, our model incorporates an iterative consensus generator that adopts a multi-iteration generative method, enabling multiple iterations of the answer and explanation in each generation. In each iteration, the current answer is utilized to generate an explanation, which in turn guides the generation of a new answer. Additionally, a knowledge retrieval module is introduced to provide potentially valid candidate knowledge, guide the generation process, effectively bridge the gap between questions and images, and enable the production of high-quality answer-explanation pairs. Extensive experiments conducted on three different datasets demonstrate the superiority of our proposed KICNLE model over competing state-of-the-art approaches. Our code is available at https://github.com/Gary-code/KICNLE.

Dual Learning for Visual Question Generation.

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Simple and Effective Visual Question Answering in a Single Modality

Question Answering and Question Generation as Dual Tasks

Learning to Generate Visual Questions with Noisy Supervision

Question-Guided Semantic Dual-Graph Visual Reasoning with Novel Answers.

Multi-Question Learning for Visual Question Answering

Multitask Learning for Visual Question Answering

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

Visual Question Generation Under Multi-granularity Cross-Modal Interaction.

Exploring Diverse Methods in Visual Question Answering

Learning to Generate Question by Asking Question: A Primal-Dual Approach with Uncommon Word Generation

The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions

LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering

Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference

LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

Joint Learning of Object Graph and Relation Graph for Visual Question Answering

Knowledge-Augmented Visual Question Answering With Natural Language Explanation

Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.

DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training