Learning from Natural Language Explanations for Generalizable Entity Matching

Somin Wadhwa,Adit Krishnan,Runhui Wang,Byron C. Wallace,Chris Kong
2024-09-28
Abstract:Entity matching is the task of linking records from different sources that refer to the same real-world entity. Past work has primarily treated entity linking as a standard supervised learning problem. However, supervised entity matching models often do not generalize well to new data, and collecting exhaustive labeled training data is often cost prohibitive. Further, recent efforts have adopted LLMs for this task in few/zero-shot settings, exploiting their general knowledge. But LLMs are prohibitively expensive for performing inference at scale for real-world entity matching tasks. As an efficient alternative, we re-cast entity matching as a conditional generation task as opposed to binary classification. This enables us to "distill" LLM reasoning into smaller entity matching models via natural language explanations. This approach achieves strong performance, especially on out-of-domain generalization tests (10.85% F-1) where standalone generative methods struggle. We perform ablations that highlight the importance of explanations, both for performance and model robustness.
Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "Learning from Natural Language Explanations for General - Purpose Entity Matching" aims to address several key challenges in entity - matching tasks: 1. **Lack of cross - domain generalization ability**: - Supervised learning methods often find it difficult to generalize entity - matching models trained on specific domains to new datasets. For example, a model trained on an electronics product dataset may not be able to handle a footwear dataset well. 2. **High cost of labeled data**: - Collecting large - scale labeled data is very expensive, especially when labeling in multiple different domains. 3. **High inference cost of large language models (LLMs)**: - Although LLMs perform well in zero - shot or few - shot settings, their inference cost is high and they are not suitable for large - scale practical applications. ### Solutions To address the above problems, the paper proposes a new method that redefines the entity - matching task as a conditional generation task and uses natural - language explanations to enhance the performance of small models. The specific steps are as follows: 1. **Conditional generation task**: - Consider the entity - matching task as a conditional generation task rather than a traditional binary - classification task. This enables the model to generate richer outputs, including matching labels and explanations. 2. **Using large language models to generate explanations**: - Use large language models (such as Mistral - 7B - Instruct and Alpaca) to generate natural - language explanations that contain the model's reasoning process for entity - matching decisions. 3. **Model distillation**: - Use the generated explanations to train smaller generation models (such as FlanT5 - base) so that they can not only perform entity - matching but also provide supporting explanations. ### Experimental results 1. **Performance improvement**: - In cross - domain, cross - schema, and cross - distribution tests, using explanation - enhanced training data significantly improves the model's performance. For example, in the cross - domain setting, the F - 1 score is increased by an average of 22.32%, and in the cross - schema and cross - distribution settings, it is increased by 14.47% and 13.67% respectively. 2. **Enhanced robustness**: - Through human - intervention tests, it is found that the model enhanced with explanations is more robust when facing slight input changes. For example, among 300 test instances, the model without explanations has only 23% of the labels correctly flipped, while the model with explanations reaches 54%. 3. **Accuracy of explanations**: - Through human annotation, the accuracy of the generated explanations is evaluated. The results show that 10.9% of the explanations have internal errors and 15.1% of the explanations contain information irrelevant to the input (i.e., "hallucinations"). ### Conclusion By redefining the entity - matching task as a conditional generation task and using natural - language explanations generated by large language models to enhance the performance of small models, the paper effectively addresses the problems of lack of cross - domain generalization ability, high cost of labeled data, and high inference cost of large language models. The experimental results show that this method performs well in multiple test settings and has high practical value.