Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Jinyuan Li,Ziyan Li,Han Li,Jianfei Yu,Rui Xia,Di Sun,Gang Pan
2024-06-11
Abstract:Grounded Multimodal Named Entity Recognition (GMNER) task aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging attributes: 1) The tenuous correlation between images and text on social media contributes to a notable proportion of named entities being ungroundable. 2) There exists a distinction between coarse-grained noun phrases used in similar tasks (e.g., phrase localization) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as connecting bridges. This reformulation brings two benefits: 1) It enables us to optimize the MNER module for optimal MNER performance and eliminates the need to pre-extract region features using object detection methods, thus naturally addressing the two major limitations of existing GMNER methods. 2) The introduction of Entity Expansion Expression module and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). This endows the proposed framework with unlimited data and model scalability. Furthermore, to address the potential ambiguity stemming from the coarse-grained bounding box output in GMNER, we further construct the new Segmented Multimodal Named Entity Recognition (SMNER) task and corresponding Twitter-SMNER dataset aimed at generating fine-grained segmentation masks, and experimentally demonstrate the feasibility and effectiveness of using box prompt-based Segment Anything Model (SAM) to empower any GMNER model with the ability to accomplish the SMNER task. Extensive experiments demonstrate that RiVEG significantly outperforms SoTA methods on four datasets across the MNER, GMNER, and SMNER tasks.
Multimedia,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper aims to address two main challenges in the Grounded Multimodal Named Entity Recognition (GMNER) task: 1. **Weak Association between Text and Image**: On social media, there is often a weak correlation between text and image, which makes it difficult to locate or verify many named entities through images. 2. **Differences between Coarse - grained Noun Phrases and Fine - grained Named Entities**: Existing tasks such as phrase localization usually use coarse - grained noun phrases, while the GMNER task requires the identification of fine - grained named entities, and there are significant differences between the two. To solve these problems, the authors propose a unified framework RiVEG, which improves the GMNER task in the following ways: - **Task Redefinition**: Redefine the GMNER task as a combined MNER (Multimodal Named Entity Recognition), VE (Visual Entailment) and VG (Visual Grounding) task. By using large - scale language models (LLMs) as a connecting bridge, this redefined method can optimize the performance of the MNER module and avoid the need for pre - extraction of regional features, thus naturally solving the two main limitations of existing GMNER methods. - **Entity Extended Representation Module and Visual Entailment Module**: Introduce an entity extended representation module and a visual entailment module, which unify visual grounding and entity localization and endow the framework with unlimited data and model extensibility. - **Fine - grained Segmentation Task (SMNER)**: To further improve the visual grounding accuracy of named entities, the authors propose a new task - Fine - grained Multimodal Named Entity Recognition (SMNER), and construct the corresponding Twitter - SMNER dataset for generating fine - grained segmentation masks. Experimental results show that using the box - prompt - based Segment Anything Model (SAM) can effectively enhance the ability of any GMNER model to complete the SMNER task. Through these innovations, RiVEG significantly outperforms the existing state - of - the - art methods in the MNER, GMNER and SMNER tasks on four datasets.