Abstract:Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and corresponding visual regions of entities from given sentence-image pairs data. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. The former, utilizing human-designed queries, struggles to differentiate ambiguous entities, such as Jordan (Person) and off-White x Jordan (Shoes). The latter, following the one-by-one decoding order, suffers from exposure bias issues. We maintain that these works misunderstand the relationships of multimodal entities. To tackle these, we propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels. Specifically, MQSPN consists of a Multi-grained Query Set (MQS) and a Multimodal Set Prediction Network (MSP). MQS explicitly aligns entity regions with entity spans by employing a set of learnable queries to strengthen intra-entity connections. Based on distinct intra-entity modeling, MSP reformulates GMNER as a set prediction, guiding models to establish appropriate inter-entity relationships from a global matching perspective. Additionally, we incorporate a query-guided Fusion Net (QFNet) to work as a glue network between MQS and MSP. Extensive experiments demonstrate that our approach achieves state-of-the-art performances in widely used benchmarks.
Information Retrieval,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to simultaneously extract the text spans, types of entities and the corresponding visual regions in the given text - image pair data. Specifically, the paper focuses on the **Grounded Multimodal Named Entity Recognition (GMNER)** task, which is an emerging information extraction task aiming to simultaneously identify the text positions, types of entities and their corresponding regions in the image from text - image pairs.
### Limitations of Existing Methods
1. **Methods Based on Machine Reading Comprehension (MRC)**
- These methods rely on manually - designed queries to guide entity recognition and entity localization, but perform poorly when dealing with ambiguous entities. For example, in Figure 1(a), when multiple fixed person - name queries are input, the model may wrongly recognize "off - White x Jordan" (shoes) as "Jordan" (person) and assign its wrong region to Kevin Durant.
2. **Methods Based on Sequence Generation**
- These methods decode the text spans, types and regions of entities one by one in a predefined order, resulting in the prediction results being highly sensitive to the previous predictions, thus causing the exposure bias problem. For example, in Figure 1(b), if the detection of Kevin Durant is wrong, then the subsequent region prediction of "off - White x Jordan" will also be affected.
### The Method Proposed in the Paper
To solve the above problems, the paper proposes a brand - new unified framework - **Multi - grained Query - guided Set Prediction Network (MQSPN)**. This framework mainly consists of the following parts:
1. **Multi - grained Query Set (MQS)**
- MQS explicitly aligns entity regions and entity text spans through a set of learnable queries, enhancing the internal connections of entities. Specifically, MQS contains two parts: Type - grained Query and Entity - grained Query. The Type - grained Query is generated by the BERT model, and the Entity - grained Query is randomly initialized and learned.
2. **Multimodal Set Prediction Network (MSP)**
- MSP reformulates the GMNER task as a set prediction problem, guiding the model to establish appropriate relationships between entities from a global matching perspective. MSP adopts a non - autoregressive way to predict a set of multimodal entities in parallel, avoiding the problem of relying on a predefined decoding order.
3. **Query - guided Fusion Net (QFNet)**
- QFNet, as the gluing network between MQS and MSP, is used to filter out irrelevant visual features and improve the alignment effect of text and visual features. QFNet integrates the representations of text and visual regions respectively through queries as intermediaries.
### Main Contributions
1. **In - depth Exploration of the Weaknesses of Existing Unified GMNER Methods**
- Analyze the deficiencies of existing methods from two levels of internal entity relationships and relationships between entities, and propose the MQSPN framework to adaptively learn internal entity relationships and establish relationships between entities from a globally optimal matching perspective.
2. **First Application of the Set Prediction Paradigm to the GMNER Task**
- Propose MSP, which is the first attempt to apply set prediction technology to the GMNER task.
3. **Experimental Verification**
- Extensive experiments on two Twitter benchmark datasets show that the proposed method is significantly superior to the existing state - of - the - art methods. Ablation studies also verify the effectiveness of each designed module.
Through these innovations, MQSPN performs excellently in handling complex multimodal entity recognition tasks, especially in fine - grained entity type recognition.