Abstract:While large visual-language models (LVLM) have shown promising results on traditional visual question answering benchmarks, it is still challenging for them to answer complex VQA problems which requires diverse world knowledge. Motivated by the research of retrieval-augmented generation in the field of natural language processing, we use Dense Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions. However, DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information. Thus, the retrieved knowledge is not truly conducive to helping answer the question, affecting the performance of the overall system. To address this issue, we propose a novel framework that leverages the visual-language model to select the key knowledge retrieved by DPR and answer questions. The framework consists of two modules: Selector and Answerer, where both are initialized by the LVLM and parameter-efficiently finetuned by self-bootstrapping: find key knowledge in the retrieved knowledge documents using the Selector, and then use them to finetune the Answerer to predict answers; obtain the pseudo-labels of key knowledge documents based on the predictions of the Answerer and weak supervision labels, and then finetune the Selector to select key knowledge; repeat. Our framework significantly enhances the performance of the baseline on the challenging open-domain Knowledge-based VQA benchmark, OK-VQA, achieving a state-of-the-art accuracy of 62.83%. Our code is publicly available at <a class="link-external link-https" href="https://github.com/haodongze/Self-KSel-QAns" rel="external noopener nofollow">this https URL</a>.

Enhancing BERT-Based Visual Question Answering through Keyword-Driven Sentence Selection

Simple and Effective Visual Question Answering in a Single Modality

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Learning from Lexical Perturbations for Consistent Visual Question Answering

Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism

Visual Explanation for Open-Domain Question Answering with BERT.

Knowledge-Based Counterfactual Queries for Visual Question Answering

Document Visual Question Answering Challenge 2020

Question-Guided Semantic Dual-Graph Visual Reasoning with Novel Answers.

Check It Again: Progressive Visual Question Answering via Visual Entailment

Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant

Multitask Fine Tuning on Pretrained Language Model for Retrieval-Based Question Answering in Automotive Domain

Task-driven Visual Saliency and Attention-based Visual Question Answering

Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models

Enhancing Document-Based Question Answering via Interaction Between Question Words and POS Tags.

Cross-modal Retrieval for Knowledge-based Visual Question Answering

Video Question Answering Using CLIP-Guided Visual-Text Attention

NLP at UC Santa Cruz at SemEval-2024 Task 5: Legal Answer Validation using Few-Shot Multi-Choice QA

Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning