Effective entity matching with transformers

Yuliang Li,Jinfeng Li,Yoshi Suhara,AnHai Doan,Wang-Chiew Tan
DOI: https://doi.org/10.1007/s00778-023-00779-z
2023-01-19
Abstract:We present , a novel entity matching system based on pre-trained Transformer language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F 1 score on benchmark datasets. We also developed three optimization techniques to further improve 's matching capability. allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of by up to 9.8%. Perhaps more surprisingly, we establish that can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate 's effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, achieves a high F 1 score of 96.5%.
computer science, information systems, hardware & architecture
What problem does this paper attempt to address?