On Analyzing the Role of Image for Visual-enhanced Relation Extraction

Lei Li,Xiang Chen,Shuofei Qiao,Feiyu Xiong,Huajun Chen,Ningyu Zhang
DOI: https://doi.org/10.48550/arXiv.2211.07504
2022-11-15
Abstract:Multimodal relation extraction is an essential task for knowledge graph construction. In this paper, we take an in-depth empirical analysis that indicates the inaccurate information in the visual scene graph leads to poor modal alignment weights, further degrading performance. Moreover, the visual shuffle experiments illustrate that the current approaches may not take full advantage of visual information. Based on the above observation, we further propose a strong baseline with an implicit fine-grained multimodal alignment based on Transformer for multimodal relation extraction. Experimental results demonstrate the better performance of our method. Codes are available at <a class="link-external link-https" href="https://github.com/zjunlp/DeepKE/tree/main/example/re/multimodal" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
The paper aims to explore the role of images in the task of Multimodal Relation Extraction (MRE) and attempts to address the issues present in existing methods. Specifically, the paper focuses on the following aspects: 1. **Problems with existing methods**: Current mainstream relation extraction methods are primarily based on textual information, and their performance significantly declines when dealing with social media texts, as these texts often lack contextual information. However, visual content (such as images) on social media usually appears alongside text and can be used to supplement missing semantic information, thereby improving performance. 2. **Limitations of visual scene graphs**: Through empirical analysis, the paper finds that current methods do not fully utilize visual information. Inaccurate or misleading information in visual scene graphs can lead to poor modality alignment weights, which in turn affects model performance. Additionally, experimental results show that in some cases, pure text models even outperform visually enhanced models. 3. **Proposed new method**: Based on the above observations, the authors propose a new baseline model called IFAformer. This model employs a Transformer-based dual-stream architecture to encode multimodal inputs and captures the correlation between visual objects and textual entities through an implicit fine-grained multimodal alignment mechanism. Experimental results indicate that IFAformer not only performs well on standard datasets but also shows a significant performance drop when image and text pairs are shuffled, demonstrating that the model indeed utilizes visual information for relation extraction. In summary, the paper attempts to address the issue of insufficient utilization of visual information in existing multimodal relation extraction methods and proposes a new method, IFAformer, aiming to achieve better performance in practical applications.