Abstract:Entity resolution, the task of identifying and merging records that refer to the same real-world entity, is crucial in sectors like e-commerce, healthcare, and law enforcement. Large Language Models (LLMs) introduce an innovative approach to this task, capitalizing on their advanced linguistic capabilities and a ``pay-as-you-go'' model that provides significant advantages to those without extensive data science expertise. However, current LLMs are costly due to per-API request billing. Existing methods often either lack quality or become prohibitively expensive at scale. To address these problems, we propose an uncertainty reduction framework using LLMs to improve entity resolution results. We first initialize possible partitions of the entity cluster, refer to the same entity, and define the uncertainty of the result. Then, we reduce the uncertainty by selecting a few valuable matching questions for LLM verification. Upon receiving the answers, we update the probability distribution of the possible partitions. To further reduce costs, we design an efficient algorithm to judiciously select the most valuable matching pairs to query. Additionally, we create error-tolerant techniques to handle LLM mistakes and a dynamic adjustment method to reach truly correct partitions. Experimental results show that our method is efficient and effective, offering promising applications in real-world tasks.

What problem does this paper attempt to address?

This paper attempts to solve the cost - effectiveness problem in entity resolution (ER). Specifically, the paper focuses on how to use large language models (LLMs) to improve the efficiency and effectiveness of entity resolution while controlling costs. Entity resolution refers to identifying and merging records that refer to the same real - world entity, and this task is crucial in fields such as e - commerce, healthcare, and law enforcement. ### Main Problems 1. **Cost Problem**: - Current LLMs are costly to use because they are billed per API request. For large - scale data sets, simply asking all matching questions will become very expensive. - Existing methods are either of low quality or too costly for large - scale applications. 2. **Balance between Accuracy and Cost**: - It is necessary to reduce unnecessary API requests to cut costs while ensuring the accuracy of entity resolution. ### Solutions To address the above problems, the paper proposes a framework based on uncertainty reduction, using LLMs to improve the results of entity resolution. The specific steps are as follows: 1. **Initialize Possible Partitions**: - Use traditional similarity tools to initialize possible entity partitions and define the uncertainty of the results. 2. **Select Valuable Matching Questions**: - Reduce uncertainty by selecting a small number of valuable matching questions (MQs) for LLMs verification. - Design an efficient algorithm to select the most valuable questions from a large number of matching pairs for querying. 3. **Adjust Probability Distribution**: - Update the probability distribution of possible partitions according to the feedback of LLMs to further reduce uncertainty. - Introduce fault - tolerance techniques to handle LLMs' errors and design a dynamic adjustment method to achieve the correct partition. ### Key Contributions 1. **Uncertainty Reduction Framework**: - Propose a method based on uncertainty reduction for using LLMs in entity resolution. - Prove that the uncertainty reduction of the set of matching questions is equivalent to the joint entropy of its set of possible answers. 2. **NP - Hardness of Matching Question Selection Problem**: - Prove that the optimization problem of selecting matching questions is NP - hard, and propose an optimal solution based on a greedy algorithm that can efficiently select matching questions under budget constraints. 3. **Fault - Tolerance Design**: - Design fault - tolerance techniques that allow LLMs to reduce uncertainty even when providing imperfect answers. - Through the dynamic adjustment method, accurate entity partitioning can be achieved even if the initial estimate is deviated. ### Experimental Results The paper verifies the effectiveness and efficiency of the proposed method through experiments, showing its application potential in actual tasks. The experimental results show that this method can significantly improve the accuracy and efficiency of entity resolution while reducing costs.

On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach

BoostER: Leveraging Large Language Models for Enhancing Entity Resolution

Leveraging Large Language Models for Entity Matching

Disambiguate Entity Matching using Large Language Models through Relation Discovery

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching

LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking

Leveraging large language models for efficient representation learning for entity resolution

LLM-Align: Utilizing Large Language Models for Entity Alignment in Knowledge Graphs

Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models

Optimizing Numerical Estimation and Operational Efficiency in the Legal Domain through Large Language Models

Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning

OptLLM: Optimal Assignment of Queries to Large Language Models

Entity Alignment with Noisy Annotations from Large Language Models

Structured Entity Extraction Using Large Language Models

Unlocking the Power of Large Language Models for Entity Alignment

Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration

Entity Matching using Large Language Models

VANER: Leveraging Large Language Model for Versatile and Adaptive Biomedical Named Entity Recognition

Leveraging Large Language Models for Exploiting ASR Uncertainty

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency