End-to-End Visual Grounding Framework for Multimodal NER in Social Media Posts

Yifan Lyu,Jiapei Hu,Yun Xue,Qianhua Cai
DOI: https://doi.org/10.1109/tcss.2024.3402738
2024-01-01
IEEE Transactions on Computational Social Systems
Abstract:Multimodal named entity recognition (MNER) for social media aims to detect named entities in user-generated posts with the aid of visual information from attached images. Existing methods use pretrained visual models or visual grounding (VG) toolkits to learn visual information. However, they still suffer from the mismatch issue, where the visual features extracted from visual encoder are inconsistent with actual requirements for cross-modal interaction. In an ideal scenario, the visual encoder should actively extract visual information guided by the text, which inherently provides the blueprint of desired visual features. In this article, we present an end-to-end VG framework for MNER task (VG-MNER), which adaptively learns the text-related visual features. Specifically, we introduce a backbone network with a feature fusion module to learn and aggregate multisize visual representations. We then develop a text-related visual attention to refine the visual features. Notably, entity-image contrast loss is designed to guide the training of visual encoder. The proposed model outperforms several state-of-the-art methods, achieving F1 scores of 75.62% and 88.11% on two benchmark datasets. Experimental results reveal the effectiveness of leveraging text-related visual information in the MNER task.
What problem does this paper attempt to address?