Abstract:Multimodal relation extraction is an essential task for knowledge graph construction. In this paper, we take an in-depth empirical analysis that indicates the inaccurate information in the visual scene graph leads to poor modal alignment weights, further degrading performance. Moreover, the visual shuffle experiments illustrate that the current approaches may not take full advantage of visual information. Based on the above observation, we further propose a strong baseline with an implicit fine-grained multimodal alignment based on Transformer for multimodal relation extraction. Experimental results demonstrate the better performance of our method. Codes are available at <a class="link-external link-https" href="https://github.com/zjunlp/DeepKE/tree/main/example/re/multimodal" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper aims to explore the role of images in the task of Multimodal Relation Extraction (MRE) and attempts to address the issues present in existing methods. Specifically, the paper focuses on the following aspects: 1. **Problems with existing methods**: Current mainstream relation extraction methods are primarily based on textual information, and their performance significantly declines when dealing with social media texts, as these texts often lack contextual information. However, visual content (such as images) on social media usually appears alongside text and can be used to supplement missing semantic information, thereby improving performance. 2. **Limitations of visual scene graphs**: Through empirical analysis, the paper finds that current methods do not fully utilize visual information. Inaccurate or misleading information in visual scene graphs can lead to poor modality alignment weights, which in turn affects model performance. Additionally, experimental results show that in some cases, pure text models even outperform visually enhanced models. 3. **Proposed new method**: Based on the above observations, the authors propose a new baseline model called IFAformer. This model employs a Transformer-based dual-stream architecture to encode multimodal inputs and captures the correlation between visual objects and textual entities through an implicit fine-grained multimodal alignment mechanism. Experimental results indicate that IFAformer not only performs well on standard datasets but also shows a significant performance drop when image and text pairs are shuffled, demonstrating that the model indeed utilizes visual information for relation extraction. In summary, the paper attempts to address the issue of insufficient utilization of visual information in existing multimodal relation extraction methods and proposes a new method, IFAformer, aiming to achieve better performance in practical applications.

On Analyzing the Role of Image for Visual-enhanced Relation Extraction

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

Visual relationship detection with a deep convolutional relationship network

Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction

Watch and Read! A Visual Relation-Aware and Textual Evidence Enhanced Model for Multimodal Relation Extraction

Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context Learners

Towards Bridged Vision and Language: Learning Cross-modal Knowledge Representation for Relation Extraction

Caption-Aware Multimodal Relation Extraction with Mutual Information Maximization

Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction

Visual Description Augmented Integration Network for Multimodal Entity and Relation Extraction

TSVFN: Two-Stage Visual Fusion Network for multimodal relation extraction

Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling

Relation Extraction with Knowledge-Enhanced Prompt-Tuning on Multimodal Knowledge Graph

Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis

Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction

Probing the Impacts of Visual Context in Multimodal Entity Alignment

A Hierarchical Network for Multimodal Document-Level Relation Extraction

Vision, Deduction and Alignment: An Empirical Study on Multi-modal Knowledge Graph Alignment

Focus & Gating: A Multimodal Approach for Unveiling Relations in Noisy Social Media

Visual Relations Augmented Cross-modal Retrieval