Abstract:Grounded Multimodal Named Entity Recognition (GMNER) task aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging attributes: 1) The tenuous correlation between images and text on social media contributes to a notable proportion of named entities being ungroundable. 2) There exists a distinction between coarse-grained noun phrases used in similar tasks (e.g., phrase localization) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as connecting bridges. This reformulation brings two benefits: 1) It enables us to optimize the MNER module for optimal MNER performance and eliminates the need to pre-extract region features using object detection methods, thus naturally addressing the two major limitations of existing GMNER methods. 2) The introduction of Entity Expansion Expression module and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). This endows the proposed framework with unlimited data and model scalability. Furthermore, to address the potential ambiguity stemming from the coarse-grained bounding box output in GMNER, we further construct the new Segmented Multimodal Named Entity Recognition (SMNER) task and corresponding Twitter-SMNER dataset aimed at generating fine-grained segmentation masks, and experimentally demonstrate the feasibility and effectiveness of using box prompt-based Segment Anything Model (SAM) to empower any GMNER model with the ability to accomplish the SMNER task. Extensive experiments demonstrate that RiVEG significantly outperforms SoTA methods on four datasets across the MNER, GMNER, and SMNER tasks.

Contrastive Pre-training with Multi-level Alignment for Grounded Multimodal Named Entity Recognition

Multimodal Named Entity Recognition with Bottleneck Fusion and Contrastive Learning.

MGICL: Multi-Grained Interaction Contrastive Learning for Multimodal Named Entity Recognition

Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition

Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition

Pretraining Multi-modal Representations for Chinese NER Task with Cross-Modality Attention

Multimodal heterogeneous graph entity-level fusion for named entity recognition with multi-granularity visual guidance

GNN-Based Multimodal Named Entity Recognition

Learning Implicit Entity-object Relations by Bidirectional Generative Alignment for Multimodal NER

End-to-End Visual Grounding Framework for Multimodal NER in Social Media Posts

Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Grounded Multimodal Named Entity Recognition on Social Media

A Multi-Task Framework Based on Decomposition for Multimodal Named Entity Recognition

Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance

2M-NER: Contrastive Learning for Multilingual and Multimodal NER with Language and Modal Fusion

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

P-MNER: Cross Modal Correction Fusion Network with Prompt Learning for Multimodal Named Entity Recognition

MAFN: Multi-Level Attention Fusion Network for Multimodal Named Entity Recognition

CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention

MNER-QG: An End-to-End MRC Framework for Multimodal Named Entity Recognition with Query Grounding

MPMRC-MNER: A Unified MRC Framework for Multimodal Named Entity Recognition Based Multimodal Prompt