Abstract:Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M$^{3}$D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.

M3: A Multi-Image Multi-Modal Entity Alignment Dataset

Rethinking Uncertainly Missing and Ambiguous Visual Modality in Multi-Modal Entity Alignment

Multi‐scale Cross‐domain Alignment for Person Image Generation

MEAformer: Multi-modal Entity Alignment Transformer for Meta Modality Hybrid

Leveraging Intra-modal and Inter-modal Interaction for Multi-Modal Entity Alignment

IBMEA: Exploring Variational Information Bottleneck for Multi-modal Entity Alignment

Attribute-Consistent Knowledge Graph Representation Learning for Multi-Modal Entity Alignment

MCSFF: Multi-modal Consistency and Specificity Fusion Framework for Entity Alignment

LoginMEA: Local-to-Global Interaction Network for Multi-modal Entity Alignment

Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment

M$^{3}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction

Multimodal Entity Linking: A New Dataset and A Baseline

Multi-Modal Entity Alignment Method Based on Feature Enhancement

Multi-modal Contrastive Representation Learning for Entity Alignment

A Multi-Modal Entity Alignment Method with Inter-Modal Enhancement

MMIEA: Multi-modal Interaction Entity Alignment Model for Knowledge Graphs.

$M^3EL$: A Multi-task Multi-topic Dataset for Multi-modal Entity Linking

Multi-information Embedding Based Entity Alignment.

M3AE: Multimodal Representation Learning for Brain Tumor Segmentation with Missing Modalities

MAF - A General Matching and Alignment Framework for Multimodal Named Entity Recognition.