Cross-modal Retrieval for Knowledge-based Visual Question Answering

Paul Lerner,Olivier Ferret,Camille Guinaudeau
2024-01-11
Abstract:Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to recognize. We argue that cross-modal retrieval may help bridge the semantic gap between an entity and its depictions, and is foremost complementary with mono-modal retrieval. We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different strategies to fine-tune such a model: mono-modal, cross-modal, or joint training. Our method, which combines mono-and cross-modal retrieval, is competitive with billion-parameter models on the three datasets, while being conceptually simpler and computationally cheaper.
Computation and Language,Information Retrieval
What problem does this paper attempt to address?
The paper primarily addresses the entity retrieval problem in Knowledge-based Visual Question Answering (KVQAE). Specifically, the researchers propose a method that combines unimodal and cross-modal retrieval to solve the challenge of recognizing entities in different visual representations. The core contributions of the paper include: - **Problem Background**: In KVQAE tasks, it is necessary to retrieve information from a multimodal knowledge base to answer questions about named entities. The same entity may have different visual representations, making entity recognition difficult. - **Methodology**: The authors believe that cross-modal retrieval can help bridge the semantic gap between entities and their representations and complement unimodal retrieval. They experimentally validated the effectiveness of a dual-encoder model (e.g., CLIP) that includes both cross-modal and unimodal retrieval. This model can handle interactions from image to text (IqTp) and image to image (IqIp) simultaneously. - **Experimental Design**: The researchers conducted experiments using multiple datasets (such as ViQuAE, InfoSeek, and Encyclopedic-VQA) and explored three different fine-tuning strategies (unimodal, cross-modal, and joint training). Their method performed excellently on these datasets, offering conceptual simplicity and computational efficiency compared to models with a large number of parameters. - **Technical Details**: By defining a scoring function that combines unimodal and cross-modal similarities and implementing this function using CLIP, the authors demonstrated how to effectively perform entity retrieval. Additionally, they discussed the impact of different training strategies on retrieval performance and pointed out that models combining unimodal and cross-modal training can achieve the best results. In summary, this paper aims to solve the entity recognition problem in KVQAE by proposing an effective cross-modal retrieval method, thereby improving the performance of knowledge-based question answering systems based on visual context.