Abstract:Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and corresponding visual regions of entities from given sentence-image pairs data. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. The former, utilizing human-designed queries, struggles to differentiate ambiguous entities, such as Jordan (Person) and off-White x Jordan (Shoes). The latter, following the one-by-one decoding order, suffers from exposure bias issues. We maintain that these works misunderstand the relationships of multimodal entities. To tackle these, we propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels. Specifically, MQSPN consists of a Multi-grained Query Set (MQS) and a Multimodal Set Prediction Network (MSP). MQS explicitly aligns entity regions with entity spans by employing a set of learnable queries to strengthen intra-entity connections. Based on distinct intra-entity modeling, MSP reformulates GMNER as a set prediction, guiding models to establish appropriate inter-entity relationships from a global matching perspective. Additionally, we incorporate a query-guided Fusion Net (QFNet) to work as a glue network between MQS and MSP. Extensive experiments demonstrate that our approach achieves state-of-the-art performances in widely used benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to simultaneously extract the text spans, types of entities and the corresponding visual regions in the given text - image pair data. Specifically, the paper focuses on the **Grounded Multimodal Named Entity Recognition (GMNER)** task, which is an emerging information extraction task aiming to simultaneously identify the text positions, types of entities and their corresponding regions in the image from text - image pairs. ### Limitations of Existing Methods 1. **Methods Based on Machine Reading Comprehension (MRC)** - These methods rely on manually - designed queries to guide entity recognition and entity localization, but perform poorly when dealing with ambiguous entities. For example, in Figure 1(a), when multiple fixed person - name queries are input, the model may wrongly recognize "off - White x Jordan" (shoes) as "Jordan" (person) and assign its wrong region to Kevin Durant. 2. **Methods Based on Sequence Generation** - These methods decode the text spans, types and regions of entities one by one in a predefined order, resulting in the prediction results being highly sensitive to the previous predictions, thus causing the exposure bias problem. For example, in Figure 1(b), if the detection of Kevin Durant is wrong, then the subsequent region prediction of "off - White x Jordan" will also be affected. ### The Method Proposed in the Paper To solve the above problems, the paper proposes a brand - new unified framework - **Multi - grained Query - guided Set Prediction Network (MQSPN)**. This framework mainly consists of the following parts: 1. **Multi - grained Query Set (MQS)** - MQS explicitly aligns entity regions and entity text spans through a set of learnable queries, enhancing the internal connections of entities. Specifically, MQS contains two parts: Type - grained Query and Entity - grained Query. The Type - grained Query is generated by the BERT model, and the Entity - grained Query is randomly initialized and learned. 2. **Multimodal Set Prediction Network (MSP)** - MSP reformulates the GMNER task as a set prediction problem, guiding the model to establish appropriate relationships between entities from a global matching perspective. MSP adopts a non - autoregressive way to predict a set of multimodal entities in parallel, avoiding the problem of relying on a predefined decoding order. 3. **Query - guided Fusion Net (QFNet)** - QFNet, as the gluing network between MQS and MSP, is used to filter out irrelevant visual features and improve the alignment effect of text and visual features. QFNet integrates the representations of text and visual regions respectively through queries as intermediaries. ### Main Contributions 1. **In - depth Exploration of the Weaknesses of Existing Unified GMNER Methods** - Analyze the deficiencies of existing methods from two levels of internal entity relationships and relationships between entities, and propose the MQSPN framework to adaptively learn internal entity relationships and establish relationships between entities from a globally optimal matching perspective. 2. **First Application of the Set Prediction Paradigm to the GMNER Task** - Propose MSP, which is the first attempt to apply set prediction technology to the GMNER task. 3. **Experimental Verification** - Extensive experiments on two Twitter benchmark datasets show that the proposed method is significantly superior to the existing state - of - the - art methods. Ablation studies also verify the effectiveness of each designed module. Through these innovations, MQSPN performs excellently in handling complex multimodal entity recognition tasks, especially in fine - grained entity type recognition.

Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

Multimodal heterogeneous graph entity-level fusion for named entity recognition with multi-granularity visual guidance

MNER-QG: An End-to-End MRC Framework for Multimodal Named Entity Recognition with Query Grounding

Multi-Grained Named Entity Recognition

Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Granular Entity Mapper: Advancing Fine-grained Multimodal Named Entity Recognition and Grounding

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

GNN-Based Multimodal Named Entity Recognition

Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition

The Power of Noise: Toward a Unified Multi-modal Knowledge Graph Representation Framework.

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition

In vitro activity of scorpiand-like azamacrocycle derivatives in promastigotes and intracellular amastigotes of Leishmania infantum and Leishmania braziliensis.

Multi-Granularity Cross-Modality Representation Learning for Named Entity Recognition on Social Media

Multi-granularity cross-modal representation learning for named entity recognition on social media

Noise-powered Multi-modal Knowledge Graph Representation Framework

CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention

Learning from Different Text-Image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER

Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph Alignment Network and Word-pair Relation Tagging

On development of multimodal named entity recognition using part-of-speech and mixture of experts