Abstract:Multimodal Entity Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to the referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods focus heavily on using complex mechanisms and extensive model tuning methods to model the multimodal interaction on specific datasets. However, these methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. Moreover, these methods can not solve the issues like textual ambiguity, redundancy, and noisy images, which severely degrade their performance. Fortunately, the advent of Large Language Models (LLMs) with robust capabilities in text understanding and reasoning, particularly Multimodal Large Language Models (MLLMs) that can process multimodal inputs, provides new insights into addressing this challenge. However, how to design a universally applicable LLMs-based MEL approach remains a pressing challenge. To this end, we propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using LLMs. In this framework, we employ LLMs to augment the representation of mentions and entities individually by integrating textual and visual information and refining textual information. Subsequently, we employ the embedding-based method for retrieving and re-ranking candidate entities. Then, with only ~0.26% of the model parameters fine-tuned, LLMs can make the final selection from the candidate entities. Extensive experiments on three public benchmark datasets demonstrate that our solution achieves state-of-the-art performance, and ablation studies verify the effectiveness of all modules. Our code is available at <a class="link-external link-https" href="https://github.com/Javkonline/UniMEL" rel="external noopener nofollow">this https URL</a>.

Attention-Based Multimodal Entity Linking with High-Quality Images

Boosting Collective Entity Linking via Type-Guided Semantic Embedding.

Bilinear Joint Learning of Word and Entity Embeddings for Entity Linking.

Entity Linking Supported Multimodal Data: Fusing Text and Image features for Higher Accuracy

Multi-Grained Multimodal Interaction Network for Entity Linking

UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models

A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking

DWE+: Dual-Way Matching Enhanced Framework for Multimodal Entity Linking

Multimodal Entity Linking: A New Dataset and A Baseline

Generative Multimodal Entity Linking

Video Multimodal Entity Linking via Multi-Perspective Enhanced Subgraph Contrastive Network

Rethinking Uncertainly Missing and Ambiguous Visual Modality in Multi-Modal Entity Alignment

VP-MEL: Visual Prompts Guided Multimodal Entity Linking

Optimal Transport Guided Correlation Assignment for Multimodal Entity Linking

Entity Linking Model Based on Cascading Attention and Dynamic Graph

DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model

Attention-Based Joint Entity Linking with Entity Embedding

Caption-Aware Multimodal Relation Extraction with Mutual Information Maximization

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

Enhancing unsupervised medical entity linking with multi-instance learning

Enhancing Both Local and Global Entity Linking Models with Attention.