Active Learning with Density-Initialized Decision Tree for Record Matching

Chenxiao Dou,Daniel Sun,Guoqiang Li,Raymond K. Wong
DOI: https://doi.org/10.1145/3085504.3085518
2017-01-01
Abstract:One of the fundamental problem in data management and data integration fields is Record Matching, which refers to identifying records that relate to the same entities across different data sources. In recent literature, active learning has demonstrated to be effective for record matching. One of the key steps of active learning is to build a proper initial classifier, with which active learning algorithms can quickly locate informative examples for training accurate models. However, in this process, example labelling for model training is usually expensive. Even worse, if a weak initial classifier is used, the labelling cost can be significantly increased. In this paper, we propose an unsupervised algorithm to determine the initial classifier. The process of classifier initialization requires no labelling cost. Then on our proposed algorithm, we present an active sampling method for selecting informative examples. The experiments show that our approach achieves competitive learning performance with much less labelling cost than other approaches of active learning.
What problem does this paper attempt to address?