Abstract:Multimodal Question Answering (MMQA) has emerged as a challenging frontier at the intersection of natural language processing (NLP) and computer vision, demanding the integration of diverse modalities for effective comprehension and response. While pre-trained language models (PLMs) exhibit impressive performance across a range of NLP tasks, the investigation of text-based approaches to address MMQA represents a compelling and promising avenue for further research and advancement in the field. Although recent research has delved into text-based approaches for MMQA, the attained results have been unsatisfactory, which could be attributed to potential information loss during the knowledge transformation processes. In response, a novel three-stage framework named UniRaG is proposed for tackling MMQA, which encompasses unified knowledge representation, context retrieval, and answer generation. At the initial stage, advanced techniques are employed for unified knowledge representation, including LLaVA for image captioning and table linearization for tabular data, facilitating seamless integration of visual and tabular information into textual representation. For context retrieval, a cross-encoder trained on sequence classification is utilized to predict relevance scores for question-document pairs, and a top-k retrieval strategy is employed to retrieve the documents with the highest relevance scores as the contexts for answer generation. Finally, the answer generation stage is facilitated by a text-to-text PLM, Flan-T5-Base, which follows the encoder-decoder architecture with attention mechanisms. During this stage, uniform prefix conditioning is applied to the input text for enhanced adaptability and generalizability. Moreover, contextual diversity training is introduced to improve model robustness by including distractor documents as negative contexts during training. Experimental results on the MultimodalQA dataset demonstrate the superior performance of UniRaG, surpassing the existing state-of-the-art methods across all scenarios with 67.4% EM and 71.3% F1. Overall, UniRaG showcases robustness and reliability in MMQA, heralding significant advancements in multimodal comprehension and question answering research.

MulmQA: Multimodal Question Answering for Database Alarm

A Question-Answering System over Traditional Chinese Medicine

MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

MCQA: Multimodal Co-attention Based Network for Question Answering

An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models

Beyond Text QA: Multimedia Answer Generation by Harvesting Web Information

MST5 -- Multilingual Question Answering over Knowledge Graphs

Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering

HeteroQA: Learning towards Question-and-Answering through Multiple Information Sources via Heterogeneous Graph Modeling

MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

ADMUS: A Progressive Question Answering Framework Adaptable to Multiple Knowledge Sources

Exploiting Abstract Meaning Representation for Open-Domain Question Answering

MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering

ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding

UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models

Question-Aware Memory Network for Multi-hop Question Answering in Human-Robot Interaction

Cross-Lingual Question Answering over Knowledge Base as Reading Comprehension

CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart

MoCA: Incorporating domain pretraining and cross attention for textbook question answering

Bridging the Language Gap: Knowledge Injected Multilingual Question Answering

Chinese Knowledge Base Question Answering by Attention-Based Multi-Granularity Model