Abstract:The goal of multimodal named entity recognition (MNER) is to detect entity spans in given image–text pairs and classify them into corresponding entity types. Despite the success of existing works that leverage cross-modal attention mechanisms to integrate textual and visual representations, we observe three key issues. Firstly, models are prone to misguidance when fusing unrelated text and images. Secondly, most existing visual features are not enhanced or filtered. Finally, due to the independent encoding strategies employed for text and images, a noticeable semantic gap exists between them. To address these challenges, we propose a framework called visual clue guidance and consistency matching (GMF). To tackle the first issue, we introduce a visual clue guidance (VCG) module designed to hierarchically extract visual information from multiple scales. This information is utilized as an injectable visual clue guidance sequence to steer text representations for error-insensitive prediction decisions. Furthermore, by incorporating a cross-scale attention (CSA) module, we successfully mitigate interference across scales, enhancing the image’s capability to capture details. To address the third issue of semantic disparity between text and images, we employ a consistency matching (CM) module based on the idea of multimodal contrastive learning, facilitating the collaborative learning of multimodal data. To validate the effectiveness of our proposed framework, we conducted comprehensive experimental studies, including extensive comparative experiments, ablation studies, and case studies, on two widely used benchmark datasets, demonstrating the efficacy of the framework.

Visual Description Augmented Integration Network for Multimodal Entity and Relation Extraction

Visual relationship detection with a deep convolutional relationship network

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

Caption-Aware Multimodal Relation Extraction with Mutual Information Maximization

Enhancing Multimodal Entity and Relation Extraction With Variational Information Bottleneck

On Analyzing the Role of Image for Visual-enhanced Relation Extraction

Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction

Rethinking Uncertainly Missing and Ambiguous Visual Modality in Multi-Modal Entity Alignment

Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph Alignment Network and Word-pair Relation Tagging

Watch and Read! A Visual Relation-Aware and Textual Evidence Enhanced Model for Multimodal Relation Extraction

Named Entity and Relation Extraction with Multi-Modal Retrieval

Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition

TSVFN: Two-Stage Visual Fusion Network for multimodal relation extraction

Multimodal Named Entity Recognition and Relation Extraction with Retrieval-Augmented Strategy

Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction

PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition

Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis

Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction

Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling