Abstract:Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to recognize. We argue that cross-modal retrieval may help bridge the semantic gap between an entity and its depictions, and is foremost complementary with mono-modal retrieval. We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different strategies to fine-tune such a model: mono-modal, cross-modal, or joint training. Our method, which combines mono-and cross-modal retrieval, is competitive with billion-parameter models on the three datasets, while being conceptually simpler and computationally cheaper.

What problem does this paper attempt to address?

The paper primarily addresses the entity retrieval problem in Knowledge-based Visual Question Answering (KVQAE). Specifically, the researchers propose a method that combines unimodal and cross-modal retrieval to solve the challenge of recognizing entities in different visual representations. The core contributions of the paper include: - **Problem Background**: In KVQAE tasks, it is necessary to retrieve information from a multimodal knowledge base to answer questions about named entities. The same entity may have different visual representations, making entity recognition difficult. - **Methodology**: The authors believe that cross-modal retrieval can help bridge the semantic gap between entities and their representations and complement unimodal retrieval. They experimentally validated the effectiveness of a dual-encoder model (e.g., CLIP) that includes both cross-modal and unimodal retrieval. This model can handle interactions from image to text (IqTp) and image to image (IqIp) simultaneously. - **Experimental Design**: The researchers conducted experiments using multiple datasets (such as ViQuAE, InfoSeek, and Encyclopedic-VQA) and explored three different fine-tuning strategies (unimodal, cross-modal, and joint training). Their method performed excellently on these datasets, offering conceptual simplicity and computational efficiency compared to models with a large number of parameters. - **Technical Details**: By defining a scoring function that combines unimodal and cross-modal similarities and implementing this function using CLIP, the authors demonstrated how to effectively perform entity retrieval. Additionally, they discussed the impact of different training strategies on retrieval performance and pointed out that models combining unimodal and cross-modal training can achieve the best results. In summary, this paper aims to solve the entity recognition problem in KVQAE by proposing an effective cross-modal retrieval method, thereby improving the performance of knowledge-based question answering systems based on visual context.

Cross-modal Retrieval for Knowledge-based Visual Question Answering

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Simple and Effective Visual Question Answering in a Single Modality

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Multimodal Reranking for Knowledge-Intensive Visual Question Answering

Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection

Multimodal Knowledge Triple Extraction Based on Representation Learning

Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance.

Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts

MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering

Find The Gap: Knowledge Base Reasoning For Visual Question Answering

Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering

Visual Question Answering in Remote Sensing with Cross-Attention and Multimodal Information Bottleneck

Question guided multimodal receptive field reasoning network for fact-based visual question answering

Multi-Modal Answer Validation for Knowledge-Based VQA

EchoSight: Advancing Visual-Language Models with Wiki Knowledge

Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering