Liberal Entity Matching as a Compound AI Toolchain

Silvery D. Fu,David Wang,Wen Zhang,Kathleen Ge
2024-06-17
Abstract:Entity matching (EM), the task of identifying whether two descriptions refer to the same entity, is essential in data management. Traditional methods have evolved from rule-based to AI-driven approaches, yet current techniques using large language models (LLMs) often fall short due to their reliance on static knowledge and rigid, predefined prompts. In this paper, we introduce Libem, a compound AI system designed to address these limitations by incorporating a flexible, tool-oriented approach. Libem supports entity matching through dynamic tool use, self-refinement, and optimization, allowing it to adapt and refine its process based on the dataset and performance metrics. Unlike traditional solo-AI EM systems, which often suffer from a lack of modularity that hinders iterative design improvements and system optimization, Libem offers a composable and reusable toolchain. This approach aims to contribute to ongoing discussions and developments in AI-driven data management.
Databases,Artificial Intelligence,Software Engineering
What problem does this paper attempt to address?
This paper attempts to address several key challenges in entity matching (EM), which are particularly prominent in the current single - model approach based on large - language models (LLMs), namely Solo - AI EM: 1. **Dependence on Manual Parameter Tuning**: Current single - model approaches usually rely on manual parameter tuning, that is, finding the best prompt words for each new dataset through trial and error. For example, a large - language model may not be aware that color is a key feature for distinguishing different products in a specific entity - matching scenario, so it is necessary to manually specify or select a small number of examples to adapt to a specific dataset. 2. **Limitations of Static Knowledge**: The single - model approach depends on the static knowledge in the training data and may not be able to effectively identify new entities outside the training data. For example, when a new product has just been released, a large - language model may not be able to correctly identify these products for matching. 3. **Rigidity of the Data - Processing Pipeline**: Entity matching is usually part of a larger data - processing pipeline and involves other forms of data processing. Existing single - model approaches adopt fixed pre - processing rules and do not support flexible adjustment of the input data format according to the model and task requirements. For example, existing systems usually strip the pattern information in entity data, while in fact this information is very useful for improving matching accuracy. To solve the above problems, the paper proposes a compound artificial intelligence toolchain named Libem, aiming to achieve more free, efficient, and user - friendly entity matching through the following three core mechanisms: - **Tool Usage**: Provide relevant tools, such as data pre - processing and information retrieval, enabling the model to flexibly decide when and how to use these tools to better complete the entity - matching task. - **Self - Improvement**: The toolchain should be able to adapt to the input dataset automatically and optimize performance without manual parameter tuning when there is training data. Specifically, the toolchain can start from simple general prompts and gradually evolve into more specific and efficient prompts and parameters. - **Optimized Configuration**: Users should be able to easily configure and optimize the toolchain to balance performance and cost. For example, the "chain - of - thought" prompt in the browsing step can be turned off to avoid long search - result response times. Through these designs, Libem aims to overcome the limitations of existing single - model approaches and improve the accuracy and performance of entity matching.