LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

Jinyuan Li,Han Li,Di Sun,Jiahao Wang,Wenkun Zhang,Zan Wang,Gang Pan
2024-05-30
Abstract:Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging properties: 1) The weak correlation between image-text pairs in social media results in a significant portion of named entities being ungroundable. 2) There exists a distinction between coarse-grained referring expressions commonly used in similar tasks (e.g., phrase localization, referring expression comprehension) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as a connecting bridge. This reformulation brings two benefits: 1) It maintains the optimal MNER performance and eliminates the need for employing object detection methods to pre-extract regional features, thereby naturally addressing two major limitations of existing GMNER methods. 2) The introduction of entity expansion expression and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). It enables RiVEG to effortlessly inherit the Visual Entailment and Visual Grounding capabilities of any current or prospective multimodal pretraining models. Extensive experiments demonstrate that RiVEG outperforms state-of-the-art methods on the existing GMNER dataset and achieves absolute leads of 10.65%, 6.21%, and 8.83% in all three subtasks.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve are the two main challenges encountered when performing **Grounded Multimodal Named Entity Recognition (GMNER)** tasks in social media image - text pairs: 1. **Weak Correlation**: Image - text pairs in social media usually have weak correlations, resulting in a large number of named entities being unable to correspond to visual areas in the image (i.e., "ungroundable"). This makes traditional Visual Grounding (VG) methods ineffective when dealing with GMNER tasks because VG methods assume that the input text expressions must match an object in the image, while many named entities in GMNER tasks are not necessarily associated with specific regions in the image. 2. **Difference between Coarse - grained and Fine - grained Expressions**: Existing VG methods usually use coarse - grained referring expressions (such as noun phrases), while GMNER tasks need to identify fine - grained named entities. This difference makes it difficult to directly apply existing VG methods to handle GMNER tasks because VG methods may not be able to understand or correctly locate these fine - grained named entities. To overcome these challenges, the paper proposes a new framework **RiVEG**, which redefines the GMNER task in the following ways: - **Introducing Large Language Models (LLMs) as Bridges**: Utilize the auxiliary knowledge generated by LLMs to enhance the effect of Named Entity Recognition (MNER), and convert fine - grained named entities into coarse - grained referring expressions suitable for VG methods to process. - **Designing a Visual Entailment (VE) Module**: Specifically used to handle the weak correlation between image - text pairs and determine whether named entities can correspond to a certain area in the image. - **Integrating MNER, VE and VG**: Decompose the GMNER task into three stages, namely named entity recognition, visual entailment judgment and visual localization, so as to better utilize the advantages of existing methods and improve the overall performance. Through this method, RiVEG not only achieves significantly better performance than existing methods on existing GMNER datasets, but also shows a new state - of - the - art level in the MNER subtask.