MORE: A Multimodal Object-Entity Relation Extraction Dataset with a Benchmark Evaluation

Liang He,Hongke Wang,Yongchang Cao,Zhen Wu,Jianbing Zhang,Xinyu Dai

DOI: https://doi.org/10.1145/3581783.3612209

2023-12-15

Abstract:Extracting relational facts from multimodal data is a crucial task in the field of multimedia and knowledge graphs that feeds into widespread real-world applications. The emphasis of recent studies centers on recognizing relational facts in which both entities are present in one modality and supplementary information is used from other modalities. However, such works disregard a substantial amount of multimodal relational facts that arise across different modalities, such as one entity seen in a text and another in an image. In this paper, we propose a new task, namely Multimodal Object-Entity Relation Extraction, which aims to extract "object-entity" relational facts from image and text data. To facilitate research on this task, we introduce MORE, a new dataset comprising 21 relation types and 20,264 multimodal relational facts annotated on 3,559 pairs of textual news titles and corresponding images. To show the challenges of Multimodal Object-Entity Relation Extraction, we evaluated recent state-of-the-art methods for multimodal relation extraction and conducted a comprehensive experimentation analysis on MORE. Our results demonstrate significant challenges for existing methods, underlining the need for further research on this task. Based on our experiments, we identify several promising directions for future research. The MORE dataset and code are available at <a class="link-external link-https" href="https://github.com/NJUNLP/MORE" rel="external noopener nofollow">this https URL</a>.

Multimedia

What problem does this paper attempt to address?

The paper aims to address the factual issue of extracting "object-entity" relationships across different modalities (such as text and images) in multimodal data. Specifically: 1. **Proposing a New Task**: The paper introduces a new task—Multimodal Object-Entity Relation Extraction, which aims to extract "object-entity" relationship facts from text and image data. This task differs from traditional multimodal relation extraction tasks, which usually focus on entity relationships within the same modality and use other modalities as auxiliary information. 2. **Constructing a New Dataset**: To support research on this new task, the authors created a new dataset named MORE, which contains 21 types of relationships and 20,264 multimodal relationship facts, annotated on 3,559 pairs of news headlines and corresponding images. The uniqueness of these datasets lies in their coverage of extracting relationships between entities and objects from text and images respectively, which was missing in previous multimodal relation extraction datasets. 3. **Evaluating Existing Methods**: To demonstrate the challenges of the multimodal object-entity relation extraction task, the authors conducted a benchmark evaluation using the latest multimodal relation extraction methods and performed comprehensive experimental analysis on the MORE dataset. The results show that existing methods significantly underperform in handling such tasks, indicating that this task remains an open challenge. Through the above research, the paper highlights the shortcomings of existing methods and provides valuable references for future research directions.

MORE: A Multimodal Object-Entity Relation Extraction Dataset with a Benchmark Evaluation

Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

Entity-Relation Extraction As Multi-Turn Question Answering

Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

A Hierarchical Network for Multimodal Document-Level Relation Extraction

Named Entity and Relation Extraction with Multi-Modal Retrieval

Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction

MORE: A Metric Learning Based Framework for Open-domain Relation Extraction

M$^{3}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction

MMRel: A Relation Understanding Benchmark in the MLLM Era

Multimodal Entity Linking: A New Dataset and A Baseline

Caption-Aware Multimodal Relation Extraction with Mutual Information Maximization

Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction

ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery

Towards Bridged Vision and Language: Learning Cross-modal Knowledge Representation for Relation Extraction

Multimodal Named Entity Recognition and Relation Extraction with Retrieval-Augmented Strategy

On Analyzing the Role of Image for Visual-enhanced Relation Extraction

Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context Learners

Watch and Read! A Visual Relation-Aware and Textual Evidence Enhanced Model for Multimodal Relation Extraction

MAVEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction