Caption-Aware Multimodal Relation Extraction with Mutual Information Maximization

Yanhui Li,Zefan Zhang,Weiqi Zhang,Tian Bai
DOI: https://doi.org/10.1145/3664647.3681219
2024-10-28
Abstract:Multimodal Relation Extraction (MRE) has achieved great improvements. However, modern MRE models are easily affected by irrelevant objects during multimodal alignment which are called error sensitivity issues. The main reason is that visual features are not fully aligned with textual features and the reasoning process may suppress redundant and noisy information at the risk of losing critical information. In light of this, we propose a C aption-A ware Multimodal Relation Extraction Network with M utual I nformation M aximization ( CAMIM ). Specifically, we first generate detailed image captions through the Large Language Model (LLM). Then, the Caption-Aware Module (CAM) hierarchically aligns the fine-grained visual entities and textual entities for reasoning. In addition, for preserving crucial information within different modalities, we leverage a Mutual Information Maximization method to regulate the multimodal reasoning module. Experiments show that our model outperforms the state-of-the-art MRE models on the benchmark dataset MNRE. Further ablation studies prove the plug-gable and effective performance of our Caption-Aware Module and Mutual Information Maximization method. Our code is available
Computer Science
What problem does this paper attempt to address?