On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach

Huahang Li,Longyu Feng,Shuangyin Li,Fei Hao,Chen Jason Zhang,Yuanfeng Song
2024-09-12
Abstract:Entity resolution, the task of identifying and merging records that refer to the same real-world entity, is crucial in sectors like e-commerce, healthcare, and law enforcement. Large Language Models (LLMs) introduce an innovative approach to this task, capitalizing on their advanced linguistic capabilities and a ``pay-as-you-go'' model that provides significant advantages to those without extensive data science expertise. However, current LLMs are costly due to per-API request billing. Existing methods often either lack quality or become prohibitively expensive at scale. To address these problems, we propose an uncertainty reduction framework using LLMs to improve entity resolution results. We first initialize possible partitions of the entity cluster, refer to the same entity, and define the uncertainty of the result. Then, we reduce the uncertainty by selecting a few valuable matching questions for LLM verification. Upon receiving the answers, we update the probability distribution of the possible partitions. To further reduce costs, we design an efficient algorithm to judiciously select the most valuable matching pairs to query. Additionally, we create error-tolerant techniques to handle LLM mistakes and a dynamic adjustment method to reach truly correct partitions. Experimental results show that our method is efficient and effective, offering promising applications in real-world tasks.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the cost - effectiveness problem in entity resolution (ER). Specifically, the paper focuses on how to use large language models (LLMs) to improve the efficiency and effectiveness of entity resolution while controlling costs. Entity resolution refers to identifying and merging records that refer to the same real - world entity, and this task is crucial in fields such as e - commerce, healthcare, and law enforcement. ### Main Problems 1. **Cost Problem**: - Current LLMs are costly to use because they are billed per API request. For large - scale data sets, simply asking all matching questions will become very expensive. - Existing methods are either of low quality or too costly for large - scale applications. 2. **Balance between Accuracy and Cost**: - It is necessary to reduce unnecessary API requests to cut costs while ensuring the accuracy of entity resolution. ### Solutions To address the above problems, the paper proposes a framework based on uncertainty reduction, using LLMs to improve the results of entity resolution. The specific steps are as follows: 1. **Initialize Possible Partitions**: - Use traditional similarity tools to initialize possible entity partitions and define the uncertainty of the results. 2. **Select Valuable Matching Questions**: - Reduce uncertainty by selecting a small number of valuable matching questions (MQs) for LLMs verification. - Design an efficient algorithm to select the most valuable questions from a large number of matching pairs for querying. 3. **Adjust Probability Distribution**: - Update the probability distribution of possible partitions according to the feedback of LLMs to further reduce uncertainty. - Introduce fault - tolerance techniques to handle LLMs' errors and design a dynamic adjustment method to achieve the correct partition. ### Key Contributions 1. **Uncertainty Reduction Framework**: - Propose a method based on uncertainty reduction for using LLMs in entity resolution. - Prove that the uncertainty reduction of the set of matching questions is equivalent to the joint entropy of its set of possible answers. 2. **NP - Hardness of Matching Question Selection Problem**: - Prove that the optimization problem of selecting matching questions is NP - hard, and propose an optimal solution based on a greedy algorithm that can efficiently select matching questions under budget constraints. 3. **Fault - Tolerance Design**: - Design fault - tolerance techniques that allow LLMs to reduce uncertainty even when providing imperfect answers. - Through the dynamic adjustment method, accurate entity partitioning can be achieved even if the initial estimate is deviated. ### Experimental Results The paper verifies the effectiveness and efficiency of the proposed method through experiments, showing its application potential in actual tasks. The experimental results show that this method can significantly improve the accuracy and efficiency of entity resolution while reducing costs.