Entity Matching using Large Language Models

Ralph Peeters,Christian Bizer

2024-06-05

Abstract:Entity Matching is the task of deciding whether two entity descriptions refer to the same real-world entity and is a central step in most data integration pipelines. Many state-of-the-art entity matching methods rely on pre-trained language models (PLMs) such as BERT or RoBERTa. Two major drawbacks of these models for entity matching are that (i) the models require significant amounts of task-specific training data and (ii) the fine-tuned models are not robust concerning out-of-distribution entities. This paper investigates using generative large language models (LLMs) as a less task-specific training data-dependent and more robust alternative to PLM-based matchers. Our study covers hosted and open-source LLMs, which can be run locally. We evaluate these models in a zero-shot scenario and a scenario where task-specific training data is available. We compare different prompt designs and the prompt sensitivity of the models and show that there is no single best prompt but needs to be tuned for each model/dataset combination. We further investigate (i) the selection of in-context demonstrations, (ii) the generation of matching rules, as well as (iii) fine-tuning a hosted LLM using the same pool of training data. Our experiments show that the best LLMs require no or only a few training examples to perform similarly to PLMs that were fine-tuned using thousands of examples. LLM-based matchers further exhibit higher robustness to unseen entities. We show that GPT4 can generate structured explanations for matching decisions. The model can automatically identify potential causes of matching errors by analyzing explanations of wrong decisions. We demonstrate that the model can generate meaningful textual descriptions of the identified error classes, which can help data engineers improve entity matching pipelines.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

This paper aims to address two main issues in entity matching: 1) existing methods based on pre-trained language models (such as BERT or RoBERTa) require a large amount of task-specific training data for fine-tuning; 2) these models are not robust enough for unseen entities. To tackle these problems, the paper explores the potential of using generative large language models (LLMs) as an alternative. Specifically, the study evaluates the impact of different prompt designs on LLM entity matching performance in zero-shot and few-shot scenarios, and compares the performance differences between various LLM models and traditional pre-trained language model-based matchers. Experimental results show that in zero-shot scenarios, certain LLMs (especially GPT-4) even outperform traditional models that have been fine-tuned with large amounts of data, and exhibit higher robustness when dealing with unseen entities. Additionally, the paper analyzes the impact of different prompt designs on model performance and how methods such as example learning can further enhance the matching capabilities of LLMs.

Entity Matching using Large Language Models

Fine-tuning Large Language Models for Entity Matching

Leveraging Large Language Models for Entity Matching

Disambiguate Entity Matching using Large Language Models through Relation Discovery

Learning from Natural Language Explanations for Generalizable Entity Matching

Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching

Schema Matching with Large Language Models: an Experimental Study

BoostER: Leveraging Large Language Models for Enhancing Entity Resolution

Using ChatGPT for Entity Matching

AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

MapperGPT: Large Language Models for Linking and Mapping Entities

EntGPT: Linking Generative Large Language Models with Knowledge Bases

LLM-Align: Utilizing Large Language Models for Entity Alignment in Knowledge Graphs

LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking

GraLMatch: Matching Groups of Entities with Graphs and Language Models

LLMs4OM: Matching Ontologies with Large Language Models

Entity Tracking in Language Models

Advancing entity recognition in biomedicine via instruction tuning of large language models

On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach

Exploring Advanced Large Language Models with LLMsuite

Do Language Models Learn about Legal Entity Types during Pretraining?