Visual Description Augmented Integration Network for Multimodal Entity and Relation Extraction

Qingchuan Zhang,Yuanyuan Cai,Min Zuo,Jianlei Kong,Wei Dong,Yingjun Wang
DOI: https://doi.org/10.3390/app13106178
2023-05-18
Applied Sciences
Abstract:Multimodal Named Entity Recognition (MNER) and multimodal Relationship Extraction (MRE) play an important role in processing multimodal data and understanding entity relationships across textual and visual domains. However, irrelevant image information may introduce noise that misleads the recognition of information. Additionally, visual and semantic features originate from different modalities, and modal disparity hinders semantic alignment. Therefore, this paper proposes the Visual Description Augmentation Integration Network (VDAIN), which introduces an image description generation technique that allows semantic features generated from image descriptions to be presented in the same modality as the semantic features of textual information. This not only reduces the modal gap but also captures more accurately the high-level semantic information and underlying visual structure in the images. To filter out the modal noise, we use VDAIN to adaptively fuse visual features, semantic features of image descriptions, and textual information, thus eliminating irrelevant modal noise. The F1 score of the proposed model in this paper reaches 75.8% and 87.78% for the MNER task and 82.54% for the MRE task on the three public data sets, respectively, which are significantly better than the baseline model. The experimental results demonstrate the effectiveness of the proposed method in solving the modal noise and modal gap problems.
Computer Science
What problem does this paper attempt to address?