Abstract:Multimodal entity linking (MEL) aims to link ambiguous mentions within multimodal contexts to corresponding entities in a multimodal knowledge base. Most existing approaches to MEL are based on representation learning or vision-and-language pre-training mechanisms for exploring the complementary effect among multiple modalities. However, these methods suffer from two limitations. On the one hand, they overlook the possibility of considering negative samples from the same modality. On the other hand, they lack mechanisms to capture bidirectional cross-modal interaction. To address these issues, we propose a Multi-level Matching network for Multimodal Entity Linking (M3EL). Specifically, M3EL is composed of three different modules: (i) a Multimodal Feature Extraction module, which extracts modality-specific representations with a multimodal encoder and introduces an intra-modal contrastive learning sub-module to obtain better discriminative embeddings based on uni-modal differences; (ii) an Intra-modal Matching Network module, which contains two levels of matching granularity: Coarse-grained Global-to-Global and Fine-grained Global-to-Local, to achieve local and global level intra-modal interaction; (iii) a Cross-modal Matching Network module, which applies bidirectional strategies, Textual-to-Visual and Visual-to-Textual matching, to implement bidirectional cross-modal interaction. Extensive experiments conducted on WikiMEL, RichpediaMEL, and WikiDiverse datasets demonstrate the outstanding performance of M3EL when compared to the state-of-the-art baselines.

What problem does this paper attempt to address?

This paper attempts to solve two main problems in Multimodal Entity Linking (MEL): 1. **Ignoring negative samples within the same modality**: - Existing methods only consider cross - modal negative samples (e.g., mismatched samples between text and image) when conducting contrastive learning, while ignoring negative samples within the same modality (e.g., differences between different text descriptions or different images). This causes the model to be unable to fully capture the semantic differences within the modality, thus affecting the quality of the embedding representation. 2. **Lack of a two - way cross - modal interaction mechanism**: - Existing methods usually only consider one - way information flow (e.g., from text to image or from image to text) when processing cross - modal information, without achieving two - way cross - modal interaction. This one - way mechanism limits the model's full utilization of multimodal information and makes it difficult to comprehensively capture the complex relationships between modalities. To solve the above problems, the authors propose a Multi - level Matching Network for Multimodal Entity Linking (M3EL), which includes the following three modules: - **Multimodal Feature Extraction module**: - Use the pre - trained CLIP model to extract modality - specific representations of text and image, and introduce an intra - modal contrastive learning sub - module to obtain more discriminative embedding representations. - **Intra - modal Matching Network module**: - Contain two levels of matching granularity: Coarse - grained Global - to - Global matching and Fine - grained Global - to - Local matching to achieve local and global feature interactions within the modality. - **Cross - modal Matching Network module**: - Apply a two - way strategy (Textual - to - Visual and Visual - to - Textual matching) to achieve two - way cross - modal interaction and reduce the gap between different modal distributions. Through these improvements, the M3EL model can achieve better performance in the multimodal entity linking task, especially when dealing with complex multimodal data. Experimental results show that M3EL significantly outperforms existing state - of - the - art methods on multiple benchmark datasets. ### Formula Summary 1. **Intra - modal contrastive learning loss calculation**: \[ L(T_i^e, T_i^m)=-\log\frac{\theta(T_i^e, T_i^m)}{\theta(T_i^e, T_i^m)+\beta\cdot\Phi_{inner}+\gamma\cdot\Phi_{inter}} \] where: \[ \Phi_{inner}=\sum_{T_j^e\in N_e}\theta(T_i^e, T_j^e),\quad\Phi_{inter}=\sum_{T_j^m\in N_m}\theta(T_i^e, T_j^m) \] \[ \theta(x, y)=e^{\delta(x, y)/\tau},\quad\delta(x, y)\text{ is cosine similarity} \] 2. **Final contrastive learning loss**: \[ L_{cl}=\text{avg}\left(\sum_i [L(T_i^e, T_i^m)+L(T_i^m, T_i^e)+L(V_i^e, V_i^m)+L(V_i^m, V_i^e)]\right) \]

Multi-level Matching Network for Multimodal Entity Linking

Multi-Grained Multimodal Interaction Network for Entity Linking

Bilinear Joint Learning of Word and Entity Embeddings for Entity Linking.

Generative Multimodal Entity Linking

DWE+: Dual-Way Matching Enhanced Framework for Multimodal Entity Linking

UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models

Multimodal Entity Linking: A New Dataset and A Baseline

A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking

Video Multimodal Entity Linking via Multi-Perspective Enhanced Subgraph Contrastive Network

Attention-Based Multimodal Entity Linking with High-Quality Images

$M^3EL$: A Multi-task Multi-topic Dataset for Multi-modal Entity Linking

WIKIDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types

Enrich cross-lingual entity links for online wikis via multi-modal semantic matching

VP-MEL: Visual Prompts Guided Multimodal Entity Linking

Leveraging Intra-modal and Inter-modal Interaction for Multi-Modal Entity Alignment

Bridging Gaps in Content and Knowledge for Multimodal Entity Linking

Optimal Transport Guided Correlation Assignment for Multimodal Entity Linking

Entity Linking Supported Multimodal Data: Fusing Text and Image features for Higher Accuracy

LoginMEA: Local-to-Global Interaction Network for Multi-modal Entity Alignment

Rethinking Uncertainly Missing and Ambiguous Visual Modality in Multi-Modal Entity Alignment

DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking