Abstract:The goal of multimodal named entity recognition (MNER) is to detect entity spans in given image–text pairs and classify them into corresponding entity types. Despite the success of existing works that leverage cross-modal attention mechanisms to integrate textual and visual representations, we observe three key issues. Firstly, models are prone to misguidance when fusing unrelated text and images. Secondly, most existing visual features are not enhanced or filtered. Finally, due to the independent encoding strategies employed for text and images, a noticeable semantic gap exists between them. To address these challenges, we propose a framework called visual clue guidance and consistency matching (GMF). To tackle the first issue, we introduce a visual clue guidance (VCG) module designed to hierarchically extract visual information from multiple scales. This information is utilized as an injectable visual clue guidance sequence to steer text representations for error-insensitive prediction decisions. Furthermore, by incorporating a cross-scale attention (CSA) module, we successfully mitigate interference across scales, enhancing the image’s capability to capture details. To address the third issue of semantic disparity between text and images, we employ a consistency matching (CM) module based on the idea of multimodal contrastive learning, facilitating the collaborative learning of multimodal data. To validate the effectiveness of our proposed framework, we conducted comprehensive experimental studies, including extensive comparative experiments, ablation studies, and case studies, on two widely used benchmark datasets, demonstrating the efficacy of the framework.

Learning Implicit Entity-object Relations by Bidirectional Generative Alignment for Multimodal NER

Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition

Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph Alignment Network and Word-pair Relation Tagging

End-to-End Visual Grounding Framework for Multimodal NER in Social Media Posts

Object-Aware Multimodal Named Entity Recognition in Social Media Posts With Adversarial Learning

Fine-Grained Multimodal Named Entity Recognition and Grounding with a Generative Framework

Learning from Different Text-Image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER

A Multi-Task Framework Based on Decomposition for Multimodal Named Entity Recognition

Rethinking Uncertainly Missing and Ambiguous Visual Modality in Multi-Modal Entity Alignment

Contrastive Pre-training with Multi-level Alignment for Grounded Multimodal Named Entity Recognition

MCG-MNER: A Multi-Granularity Cross-Modality Generative Framework for Multimodal NER with Instruction.

MAF - A General Matching and Alignment Framework for Multimodal Named Entity Recognition.

MGICL: Multi-Grained Interaction Contrastive Learning for Multimodal Named Entity Recognition

Multimodal Named Entity Recognition with Bottleneck Fusion and Contrastive Learning.

A Span-based Multimodal Variational Autoencoder for Semi-supervised Multimodal Named Entity Recognition

Hierarchical Aligned Multimodal Learning for NER on Tweet Posts

Chinese Multimodal Named Entity Recognition in Conversational Scenarios

Generative Multimodal Data Augmentation for Low-Resource Multimodal Named Entity Recognition

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

Dynamic Graph Construction Framework for Multimodal Named Entity Recognition in Social Media