Watch and Read! A Visual Relation-Aware and Textual Evidence Enhanced Model for Multimodal Relation Extraction

Weiping Wang,Huan Liu,Qianru Shen,Hailun Lin,Zheng Lin
DOI: https://doi.org/10.1109/CSCWD61410.2024.10580454
2024-05-08
Abstract:Multimodal relation extraction (MRE) aims at predicting the semantic relation between two entities given a hybrid context of a text and its related image. Though existing MRE methods have explored different strategies to fuse multimodal information, they suffer from two limitations. First, they ignore fine-grained visual relations between objects which can provide important hints for inferring the correct relation. Second, they neglect informative textual evidence from the image, leading to a performance decline when processing text-intensive images. To address above issues, we propose a novel MRE model, named VRTE, which takes full advantage of both Visual Relations and Textual Evidence to determine the final relation label. Specifically, the input image-text pair is transformed into two scene graphs, which are further bridged into a unified multimodal graph. Next, the relation-aware Transformer is utilized to propagate information in the multimodal graph while explicitly encoding diverse relations among visual objects and textual tokens via learnable relation embeddings. Besides, a cross-attention mechanism is also used to capture valuable textual information in the OCR results and image captions, which is combined with the representations of entity nodes and original text to predict the final relation label. Experimental results on the MNRE dataset demonstrate the effectiveness of the proposed model. Extensive ablation studies are also conducted to analyse contributions of different modules.
Computer Science
What problem does this paper attempt to address?