Abstract:Multimodal Entity Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to the referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods focus heavily on using complex mechanisms and extensive model tuning methods to model the multimodal interaction on specific datasets. However, these methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. Moreover, these methods can not solve the issues like textual ambiguity, redundancy, and noisy images, which severely degrade their performance. Fortunately, the advent of Large Language Models (LLMs) with robust capabilities in text understanding and reasoning, particularly Multimodal Large Language Models (MLLMs) that can process multimodal inputs, provides new insights into addressing this challenge. However, how to design a universally applicable LLMs-based MEL approach remains a pressing challenge. To this end, we propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using LLMs. In this framework, we employ LLMs to augment the representation of mentions and entities individually by integrating textual and visual information and refining textual information. Subsequently, we employ the embedding-based method for retrieving and re-ranking candidate entities. Then, with only ~0.26% of the model parameters fine-tuned, LLMs can make the final selection from the candidate entities. Extensive experiments on three public benchmark datasets demonstrate that our solution achieves state-of-the-art performance, and ablation studies verify the effectiveness of all modules. Our code is available at <a class="link-external link-https" href="https://github.com/Javkonline/UniMEL" rel="external noopener nofollow">this https URL</a>.

MeLL: Large-scale Extensible User Intent Classification for Dialogue Systems with Meta Lifelong Learning

Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems

Intent-Aware Dialogue Generation and Multi-Task Contrastive Learning for Multi-Turn Intent Classification

Class Lifelong Learning for Intent Detection via Structure Consolidation Networks

Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models

Balancing Accuracy and Efficiency in Multi-Turn Intent Classification for LLM-Powered Dialog Systems in Production

Unsupervised Mutual Learning of Dialogue Discourse Parsing and Topic Segmentation

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

UniMC: A Unified Framework for Long-Term Memory Conversation via Relevance Representation Learning

Understanding users’ requirements precisely: a double Bi-LSTM-CRF joint model for detecting user’s intentions and slot tags

Dialogue Intent Classification with Long Short-Term Memory Networks.

A New Dialogue Response Generation Agent for Large Language Models by Asking Questions to Detect User's Intentions

UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models

SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

MIDAS: Multi-level Intent, Domain, And Slot Knowledge Distillation for Multi-turn NLU

MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments

A Self-Attention Joint Model for Spoken Language Understanding in Situational Dialog Applications

Large Language Models Meet Open-World Intent Discovery and Recognition: An Evaluation of ChatGPT

A Self-Learning Framework for Large-Scale Conversational AI Systems

Many Hands Make Light Work: Task-Oriented Dialogue System with Module-Based Mixture-of-Experts