Abstract:Given a text and its related image, the multimodal relation extraction (MRE) task aims at predicting the correct semantic relation between two entities in the input text. Though certain advances have been made by recent MRE approaches, they still suffer from two drawbacks. First, they ignore fine-grained visual relations between image objects which can serve as important clues for inferring the correct relation. Second, they are unable to utilize useful external knowledge about entities and the input sentence, leading to a sub-optimal performance when processing examples that demand related background or commonsense knowledge. To alleviate above limitations, we propose a novel method, named VRMK, which exploits both Visual Relations and Multi-grained Knowledge for the MRE task. Specifically, the input image-text pair is converted into a unified multimodal graph. Then, the relation-aware transformer is adopted to update node representations while explicitly encoding diverse relations among visual and textual nodes. Based on the input text, the powerful large language model (LLM) is used to generate entity-level and sentence-level knowledge with the in-context learning. The most relevant knowledge information is captured by a cross-attention mechanism and is further combined with the representations of entity nodes and original text to predict the final relation label. On the MNRE dataset, VRMK outperforms recent state-of-the-art baselines including LLM-based methods by 2.71% (82.55%→85.26% in F1 score). We also conduct extensive ablation experiments to reveal contributions of different modules and provide useful insights for future research.

Relation Extraction with Knowledge-Enhanced Prompt-Tuning on Multimodal Knowledge Graph

Knowledge Representation Learning with Entity Descriptions, Hierarchical Types, and Textual Relations

Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction

Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

Relation Extraction as Open-book Examination: Retrieval-enhanced Prompt Tuning

Structure Pre-training and Prompt Tuning for Knowledge Graph Transfer

Towards Bridged Vision and Language: Learning Cross-modal Knowledge Representation for Relation Extraction

Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model

Watch and Read! A Visual Relation-Aware and Textual Evidence Enhanced Model for Multimodal Relation Extraction

Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction

Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis

KICE: A Knowledge Consolidation and Expansion Framework for Relation Extraction.

KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction

APRE: Annotation-Aware Prompt-Tuning for Relation Extraction

AdaPrompt: Adaptive Prompt-based Finetuning for Relation Extraction

Multi-modal Recommendation Based on Knowledge Graph

Retrieval, Reasoning, Re-ranking: A Context-Enriched Framework for Knowledge Graph Completion

On Analyzing the Role of Image for Visual-enhanced Relation Extraction

Knowledge-Aware And Retrieval-Based Models For Distantly Supervised Relation Extraction

Named Entity and Relation Extraction with Multi-Modal Retrieval