Abstract:Grounded Multimodal Named Entity Recognition (GMNER) task aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging attributes: 1) The tenuous correlation between images and text on social media contributes to a notable proportion of named entities being ungroundable. 2) There exists a distinction between coarse-grained noun phrases used in similar tasks (e.g., phrase localization) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as connecting bridges. This reformulation brings two benefits: 1) It enables us to optimize the MNER module for optimal MNER performance and eliminates the need to pre-extract region features using object detection methods, thus naturally addressing the two major limitations of existing GMNER methods. 2) The introduction of Entity Expansion Expression module and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). This endows the proposed framework with unlimited data and model scalability. Furthermore, to address the potential ambiguity stemming from the coarse-grained bounding box output in GMNER, we further construct the new Segmented Multimodal Named Entity Recognition (SMNER) task and corresponding Twitter-SMNER dataset aimed at generating fine-grained segmentation masks, and experimentally demonstrate the feasibility and effectiveness of using box prompt-based Segment Anything Model (SAM) to empower any GMNER model with the ability to accomplish the SMNER task. Extensive experiments demonstrate that RiVEG significantly outperforms SoTA methods on four datasets across the MNER, GMNER, and SMNER tasks.

What problem does this paper attempt to address?

This paper aims to address two main challenges in the Grounded Multimodal Named Entity Recognition (GMNER) task: 1. **Weak Association between Text and Image**: On social media, there is often a weak correlation between text and image, which makes it difficult to locate or verify many named entities through images. 2. **Differences between Coarse - grained Noun Phrases and Fine - grained Named Entities**: Existing tasks such as phrase localization usually use coarse - grained noun phrases, while the GMNER task requires the identification of fine - grained named entities, and there are significant differences between the two. To solve these problems, the authors propose a unified framework RiVEG, which improves the GMNER task in the following ways: - **Task Redefinition**: Redefine the GMNER task as a combined MNER (Multimodal Named Entity Recognition), VE (Visual Entailment) and VG (Visual Grounding) task. By using large - scale language models (LLMs) as a connecting bridge, this redefined method can optimize the performance of the MNER module and avoid the need for pre - extraction of regional features, thus naturally solving the two main limitations of existing GMNER methods. - **Entity Extended Representation Module and Visual Entailment Module**: Introduce an entity extended representation module and a visual entailment module, which unify visual grounding and entity localization and endow the framework with unlimited data and model extensibility. - **Fine - grained Segmentation Task (SMNER)**: To further improve the visual grounding accuracy of named entities, the authors propose a new task - Fine - grained Multimodal Named Entity Recognition (SMNER), and construct the corresponding Twitter - SMNER dataset for generating fine - grained segmentation masks. Experimental results show that using the box - prompt - based Segment Anything Model (SAM) can effectively enhance the ability of any GMNER model to complete the SMNER task. Through these innovations, RiVEG significantly outperforms the existing state - of - the - art methods in the MNER, GMNER and SMNER tasks on four datasets.

Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

Grounded Multimodal Named Entity Recognition on Social Media

Fine-Grained Multimodal Named Entity Recognition and Grounding with a Generative Framework

End-to-End Visual Grounding Framework for Multimodal NER in Social Media Posts

MNER-QG: An End-to-End MRC Framework for Multimodal Named Entity Recognition with Query Grounding

Granular Entity Mapper: Advancing Fine-grained Multimodal Named Entity Recognition and Grounding

Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition

Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition

Query Prior Matters

Multimodal heterogeneous graph entity-level fusion for named entity recognition with multi-granularity visual guidance

Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition

Multi-Granularity Cross-Modality Representation Learning for Named Entity Recognition on Social Media

Multi-granularity cross-modal representation learning for named entity recognition on social media

Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph Alignment Network and Word-pair Relation Tagging

Contrastive Pre-training with Multi-level Alignment for Grounded Multimodal Named Entity Recognition

GSVA: Generalized Segmentation via Multimodal Large Language Models

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition

GNN-Based Multimodal Named Entity Recognition