Abstract:Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging properties: 1) The weak correlation between image-text pairs in social media results in a significant portion of named entities being ungroundable. 2) There exists a distinction between coarse-grained referring expressions commonly used in similar tasks (e.g., phrase localization, referring expression comprehension) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as a connecting bridge. This reformulation brings two benefits: 1) It maintains the optimal MNER performance and eliminates the need for employing object detection methods to pre-extract regional features, thereby naturally addressing two major limitations of existing GMNER methods. 2) The introduction of entity expansion expression and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). It enables RiVEG to effortlessly inherit the Visual Entailment and Visual Grounding capabilities of any current or prospective multimodal pretraining models. Extensive experiments demonstrate that RiVEG outperforms state-of-the-art methods on the existing GMNER dataset and achieves absolute leads of 10.65%, 6.21%, and 8.83% in all three subtasks.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the two main challenges encountered when performing **Grounded Multimodal Named Entity Recognition (GMNER)** tasks in social media image - text pairs: 1. **Weak Correlation**: Image - text pairs in social media usually have weak correlations, resulting in a large number of named entities being unable to correspond to visual areas in the image (i.e., "ungroundable"). This makes traditional Visual Grounding (VG) methods ineffective when dealing with GMNER tasks because VG methods assume that the input text expressions must match an object in the image, while many named entities in GMNER tasks are not necessarily associated with specific regions in the image. 2. **Difference between Coarse - grained and Fine - grained Expressions**: Existing VG methods usually use coarse - grained referring expressions (such as noun phrases), while GMNER tasks need to identify fine - grained named entities. This difference makes it difficult to directly apply existing VG methods to handle GMNER tasks because VG methods may not be able to understand or correctly locate these fine - grained named entities. To overcome these challenges, the paper proposes a new framework **RiVEG**, which redefines the GMNER task in the following ways: - **Introducing Large Language Models (LLMs) as Bridges**: Utilize the auxiliary knowledge generated by LLMs to enhance the effect of Named Entity Recognition (MNER), and convert fine - grained named entities into coarse - grained referring expressions suitable for VG methods to process. - **Designing a Visual Entailment (VE) Module**: Specifically used to handle the weak correlation between image - text pairs and determine whether named entities can correspond to a certain area in the image. - **Integrating MNER, VE and VG**: Decompose the GMNER task into three stages, namely named entity recognition, visual entailment judgment and visual localization, so as to better utilize the advantages of existing methods and improve the overall performance. Through this method, RiVEG not only achieves significantly better performance than existing methods on existing GMNER datasets, but also shows a new state - of - the - art level in the MNER subtask.

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

End-to-End Visual Grounding Framework for Multimodal NER in Social Media Posts

Granular Entity Mapper: Advancing Fine-grained Multimodal Named Entity Recognition and Grounding

Fine-Grained Multimodal Named Entity Recognition and Grounding with a Generative Framework

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

MNER-QG: An End-to-End MRC Framework for Multimodal Named Entity Recognition with Query Grounding

Grounded Multimodal Named Entity Recognition on Social Media

Multimodal heterogeneous graph entity-level fusion for named entity recognition with multi-granularity visual guidance

Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition

Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph Alignment Network and Word-pair Relation Tagging

A Fine-Grained Network for Joint Multimodal Entity-Relation Extraction

Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition

Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition

Query Prior Matters

Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction

Contrastive Pre-training with Multi-level Alignment for Grounded Multimodal Named Entity Recognition

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition

Generalizable Entity Grounding via Assistance of Large Language Model

Learning Implicit Entity-object Relations by Bidirectional Generative Alignment for Multimodal NER