Multimodal Question Answering for Unified Information Extraction

Yuxuan Sun,Kai Zhang,Yu Su
2023-10-05
Abstract:Multimodal information extraction (MIE) aims to extract structured information from unstructured multimedia content. Due to the diversity of tasks and settings, most current MIE models are task-specific and data-intensive, which limits their generalization to real-world scenarios with diverse task requirements and limited labeled data. To address these issues, we propose a novel multimodal question answering (MQA) framework to unify three MIE tasks by reformulating them into a unified span extraction and multi-choice QA pipeline. Extensive experiments on six datasets show that: 1) Our MQA framework consistently and significantly improves the performances of various off-the-shelf large multimodal models (LMM) on MIE tasks, compared to vanilla prompting. 2) In the zero-shot setting, MQA outperforms previous state-of-the-art baselines by a large margin. In addition, the effectiveness of our framework can successfully transfer to the few-shot setting, enhancing LMMs on a scale of 10B parameters to be competitive or outperform much larger language models such as ChatGPT and GPT-4. Our MQA framework can serve as a general principle of utilizing LMMs to better solve MIE and potentially other downstream multimodal tasks.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issues of task diversity and generalization in Multimodal Information Extraction (MIE). Specifically, the paper proposes a novel Multimodal Question Answering (MQA) framework to unify three MIE tasks: Multimodal Named Entity Recognition (MNER), Multimodal Relation Extraction (MRE), and Multimodal Event Detection (MED). The MQA framework addresses the limitations of existing methods, which are overly specific to different task settings, require large amounts of data, and are difficult to generalize, by reformulating these tasks into a unified span extraction and multiple-choice question answering pipeline. Experimental results show that, compared to traditional prompting strategies, the MQA framework significantly improves the performance of various large-scale multimodal models (LMMs) on MIE tasks. It outperforms existing state-of-the-art baselines in both zero-shot and few-shot settings and even surpasses ultra-large language models like ChatGPT and GPT-4 on most datasets. Additionally, the framework demonstrates consistent effectiveness in few-shot fine-tuning, proving its potential application value in multimodal downstream tasks.