Abstract:Given a text and its related image, the multimodal relation extraction (MRE) task aims at predicting the correct semantic relation between two entities in the input text. Though certain advances have been made by recent MRE approaches, they still suffer from two drawbacks. First, they ignore fine-grained visual relations between image objects which can serve as important clues for inferring the correct relation. Second, they are unable to utilize useful external knowledge about entities and the input sentence, leading to a sub-optimal performance when processing examples that demand related background or commonsense knowledge. To alleviate above limitations, we propose a novel method, named VRMK, which exploits both Visual Relations and Multi-grained Knowledge for the MRE task. Specifically, the input image-text pair is converted into a unified multimodal graph. Then, the relation-aware transformer is adopted to update node representations while explicitly encoding diverse relations among visual and textual nodes. Based on the input text, the powerful large language model (LLM) is used to generate entity-level and sentence-level knowledge with the in-context learning. The most relevant knowledge information is captured by a cross-attention mechanism and is further combined with the representations of entity nodes and original text to predict the final relation label. On the MNRE dataset, VRMK outperforms recent state-of-the-art baselines including LLM-based methods by 2.71% (82.55%→85.26% in F1 score). We also conduct extensive ablation experiments to reveal contributions of different modules and provide useful insights for future research.

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

On Analyzing the Role of Image for Visual-enhanced Relation Extraction

Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction

Visual relationship detection with a deep convolutional relationship network

Towards Bridged Vision and Language: Learning Cross-modal Knowledge Representation for Relation Extraction

Watch and Read! A Visual Relation-Aware and Textual Evidence Enhanced Model for Multimodal Relation Extraction

Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context Learners

Visual Description Augmented Integration Network for Multimodal Entity and Relation Extraction

Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction

Knowledge-Enhanced Scene Graph Generation with Multimodal Relation Alignment (Student Abstract)

Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

Caption-Aware Multimodal Relation Extraction with Mutual Information Maximization

Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling

Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis

Relation Extraction with Knowledge-Enhanced Prompt-Tuning on Multimodal Knowledge Graph

TSVFN: Two-Stage Visual Fusion Network for multimodal relation extraction

Focus & Gating: A Multimodal Approach for Unveiling Relations in Noisy Social Media

Vision, Deduction and Alignment: An Empirical Study on Multi-modal Knowledge Graph Alignment

Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction

A Hierarchical Network for Multimodal Document-Level Relation Extraction

Probing the Impacts of Visual Context in Multimodal Entity Alignment