P-MNER: Cross Modal Correction Fusion Network with Prompt Learning for Multimodal Named Entity Recognition

Zhuang Wang,Yijia Zhang,Kang An,Xiaoying Zhou,Mingyu Lu,Hongfei Lin
DOI: https://doi.org/10.1007/978-981-99-6207-5_13
2023-01-01
Abstract:Multimodal Named Entity Recognition (MNER) is a challenging task in social media due to the combination of text and image features. Previous MNER work has focused on predicting entity information after fusing visual and text features. However, pre-training language models have already acquired vast amounts of knowledge during their pre-training process. To leverage this knowledge, we propose a prompt network for MNER tasks (P-MNER). To minimize the noise generated by irrelevant areas in the image, we design a visual feature extraction model (FRR) based on FasterRCNN and ResNet, which uses fine-grained visual features to assist MNER tasks. Moreover, we introduce a text correction fusion module (TCFM) into the model to address visual bias during modal fusion. We employ the idea of a residual network to modify the fused features using the original text features. Our experiments on two benchmark datasets demonstrate that our proposed model outperforms existing MNER methods. P-MNER's ability to leverage pre-training knowledge from language models, incorporate fine-grained visual features, and correct for visual bias, makes it a promising approach for multimodal named entity recognition in social media posts.
What problem does this paper attempt to address?