SAREM: Semi-supervised Active Heterogeneous Entity Matching Framework.

Jinxiu Du,Tiezheng Nie,Wenzhou Dou,Derong Shen,Yue Kou
DOI: https://doi.org/10.1007/978-3-031-20309-1_7
2022-01-01
Abstract:Entity matching is a key technique in data quality research, which refers to the identification of records that refer to the same real-world entity in different data sources. This paper introduces SAREM, a semi-supervised entity matching framework for heterogeneous data. We first obtain effective feature vectors using an embedding approach that combines semantic and relational information, and this approach can be used for long sequences. Deep learning requires much-labeled data, which is very costly and time-consuming. In this paper, we address the problem by using a dropout layer for data augmentation and propose an active learning method that is more suitable for entity matching. We also address the classical challenges of deep active learning by reducing human intervention and improving model performance. We experiment with six public benchmark datasets, and the results clearly show that our method outperforms DeepER and DeepMatcher on all datasets. Our method can achieve comparable effectiveness to SOTA entity matching methods with a smaller amount of data, achieve the goal of cost reduction, and outperform SOTA entity matching methods on large datasets with long sequences.
What problem does this paper attempt to address?