A Unified Visual Prompt Tuning Framework with Mixture-of-Experts for Multimodal Information Extraction.

Bo Xu,Shizhou Huang,Ming Du,Hongya Wang,Hui Song,Yanghua Xiao,Xin Lin
DOI: https://doi.org/10.1007/978-3-031-30675-4_40
2023-01-01
Abstract:Recently, multimodal information extraction has gained increasing attention in social media understanding, as it helps to accomplish the task of information extraction by adding images as auxiliary information to solve the ambiguity problem caused by insufficient semantic information in short texts. Despite their success, current methods do not take full advantage of the information provided by the diverse representations of images. To address this problem, we propose a novel unified visual prompt tuning framework with Mixture-of-Experts to fuse different types of image representations for multimodal information extraction. Extensive experiments conducted on two different multimodal information extraction tasks demonstrate the effectiveness of our method. The source code can be found at https://github.com/xubodhu/VisualPT-MoE .
What problem does this paper attempt to address?