Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

Lei Hei,Ning An,Tingjing Liao,Qi Ma,Jiaqi Wang,Feiliang Ren

2024-08-16

Abstract:Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge graphs. Recent studies focus on extracting the relation type with entity pairs present in different modalities, such as one entity in the text and another in the image. However, existing approaches require entities and objects given beforehand, which is costly and impractical. To address the limitation, we propose a novel task, Multimodal Entity-Object Relational Triple Extraction, which aims to extract all triples (entity span, relation, object region) from image-text pairs. To facilitate this study, we modified a multimodal relation extraction dataset MORE, which includes 21 relation types, to create a new dataset containing 20,264 triples, averaging 5.75 triples per image-text pair. Moreover, we propose QEOT, a query-based model with a selective attention mechanism, to dynamically explore the interaction and fusion of textual and visual information. In particular, the proposed method can simultaneously accomplish entity extraction, relation classification, and object detection with a set of queries. Our method is suitable for downstream applications and reduces error accumulation due to the pipeline-style approaches. Extensive experimental results demonstrate that our proposed method outperforms the existing baselines by 8.06% and achieves state-of-the-art performance.

Information Retrieval

What problem does this paper attempt to address?

The paper attempts to address the following issues: 1. **Multimodal Entity-Object Relational Triple Extraction**: Existing methods for multimodal relation extraction require pre-specified entity pairs, which is both expensive and impractical in real-world applications. Therefore, the paper proposes a new task, which is to extract all possible triple forms (entity span, relation type, object region) from image-text pairs. Specifically, this task not only requires identifying entities in the text and objects in the image but also predicting the relationships between them. 2. **End-to-End Approach**: Existing methods typically adopt a pipeline approach, performing entity extraction, relation classification, and object detection separately, leading to error accumulation. The paper proposes an end-to-end approach that can accomplish these three tasks simultaneously, reducing error accumulation. 3. **New Dataset**: To facilitate research on this new task, the paper modifies the existing multimodal relation extraction dataset MORE, creating a new dataset MORTE, which contains 20,264 triples, with an average of 5.75 triples per image-text pair. Through these improvements, the paper proposes a new model QEOT (Query-based Entity-Object Transformer), which employs a selective attention mechanism and a gated fusion mechanism to dynamically explore the interaction and fusion of textual and visual information. Experimental results show that this method significantly outperforms existing baseline methods on multiple metrics.

Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

Entity-Relation Extraction As Multi-Turn Question Answering

MORE: A Multimodal Object-Entity Relation Extraction Dataset with a Benchmark Evaluation

Joint Extraction of Triple Knowledge Based on Relation Priority.

Bridging Text and Knowledge with Multi-Prototype Embedding for Few-Shot Relational Triple Extraction.

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

Mutually Guided Few-shot Learning for Relational Triple Extraction

Query-based Instance Discrimination Network for Relational Triple Extraction

Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction

Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction

Watch and Read! A Visual Relation-Aware and Textual Evidence Enhanced Model for Multimodal Relation Extraction

Relational Triple Extraction: One Step is Enough

Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis

Named Entity and Relation Extraction with Multi-Modal Retrieval

An Entity-Relation Joint Extraction Method Based on Two Independent Sub-Modules From Unstructured Text

A Bi-consolidating Model for Joint Relational Triple Extraction

Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph Alignment Network and Word-pair Relation Tagging

On Analyzing the Role of Image for Visual-enhanced Relation Extraction