Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction

Weiping Wang,Huan Liu,Qianru Shen,Hailun Lin,Zheng Lin
DOI: https://doi.org/10.1109/IJCNN60899.2024.10650770
2024-06-30
Abstract:Given a text and its related image, the multimodal relation extraction (MRE) task aims at predicting the correct semantic relation between two entities in the input text. Though certain advances have been made by recent MRE approaches, they still suffer from two drawbacks. First, they ignore fine-grained visual relations between image objects which can serve as important clues for inferring the correct relation. Second, they are unable to utilize useful external knowledge about entities and the input sentence, leading to a sub-optimal performance when processing examples that demand related background or commonsense knowledge. To alleviate above limitations, we propose a novel method, named VRMK, which exploits both Visual Relations and Multi-grained Knowledge for the MRE task. Specifically, the input image-text pair is converted into a unified multimodal graph. Then, the relation-aware transformer is adopted to update node representations while explicitly encoding diverse relations among visual and textual nodes. Based on the input text, the powerful large language model (LLM) is used to generate entity-level and sentence-level knowledge with the in-context learning. The most relevant knowledge information is captured by a cross-attention mechanism and is further combined with the representations of entity nodes and original text to predict the final relation label. On the MNRE dataset, VRMK outperforms recent state-of-the-art baselines including LLM-based methods by 2.71% (82.55%→85.26% in F1 score). We also conduct extensive ablation experiments to reveal contributions of different modules and provide useful insights for future research.
Computer Science
What problem does this paper attempt to address?