Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction

Zilin Du,Haoxin Li,Xu Guo,Boyang Li

2023-12-05

Abstract:The task of multimodal relation extraction has attracted significant research attention, but progress is constrained by the scarcity of available training data. One natural thought is to extend existing datasets with cross-modal generative models. In this paper, we consider a novel problem setting, where only unimodal data, either text or image, are available during training. We aim to train a multimodal classifier from synthetic data that perform well on real multimodal test data. However, training with synthetic data suffers from two obstacles: lack of data diversity and label information loss. To alleviate the issues, we propose Mutual Information-aware Multimodal Iterated Relational dAta GEneration (MI2RAGE), which applies Chained Cross-modal Generation (CCG) to promote diversity in the generated data and exploits a teacher network to select valuable training samples with high mutual information with the ground-truth labels. Comparing our method to direct training on synthetic data, we observed a significant improvement of 24.06% F1 with synthetic text and 26.42% F1 with synthetic images. Notably, our best model trained on completely synthetic images outperforms prior state-of-the-art models trained on real multimodal data by a margin of 3.76% in F1. Our codebase will be made available upon acceptance.

Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of data scarcity in Multimodal Relation Extraction (MRE). Specifically: 1. **Data Scarcity**: Although Multimodal Relation Extraction has the advantages of reducing ambiguity and enhancing representation learning, its progress is limited by the lack of available training data. For example, the popular MNRE-2 dataset contains only 15,485 samples, whereas text relation extraction datasets (such as WebNLG, WikiReading, and FewRel) have more relation categories and millions of data instances. 2. **Multimodal Relation Extraction with a Missing Modality (MREMM)**: During training, it is assumed that only one modality of data is available (i.e., only text or images). The paper proposes a new problem setting, which involves using cross-modal generation models to synthesize the missing modality data and training a multimodal classifier with this synthetic data to perform well on real multimodal test data. To overcome the two main obstacles brought by synthetic data—lack of data diversity and loss of label information—the authors propose a method called MI2RAGE (Mutual Information-aware Multimodal Iterated Relational dAta GEneration). This method increases data diversity through Chained Cross-modal Generation and uses a teacher network to select training samples with high mutual information with the real labels. Experimental results show that this method significantly improves the F1 score (by 3.76%) compared to the previous state-of-the-art models when using fully synthetic images.

Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction

X-Gacmn: An X-Shaped Generative Adversarial Cross-Modal Network With Hypersphere Embedding

Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

Different Data, Different Modalities! Reinforced Data Splitting for Effective Multimodal Information Extraction from Social Media Posts.

S2ynRE: Two-stage Self-training with Synthetic Data for Low-resource Relation Extraction.

Multimodal Synthetic Dataset Balancing: a Framework for Realistic and Balanced Training Data Generation in Industrial Settings

Training Multimedia Event Extraction With Generated Images and Captions

Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction

Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context Learners

Learning from Different Text-Image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER

CGI-MRE: A Comprehensive Genetic-Inspired Model For Multimodal Relation Extraction

Is synthetic data from generative models ready for image recognition?

I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal Information Extraction

Multimodal Misinformation Detection by Learning from Synthetic Data with Multimodal LLMs

Caption-Aware Multimodal Relation Extraction with Mutual Information Maximization

Watch and Read! A Visual Relation-Aware and Textual Evidence Enhanced Model for Multimodal Relation Extraction

Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction

SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective

Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling