Unsupervised Blocking and Probabilistic Parallelisation for Record Matching of Distributed Big Data

Chenxiao Dou,Yi Cui,Daniel Sun,Raymond Wong,Muhammad Atif,Guoqiang Li,Rajiv Ranjan
DOI: https://doi.org/10.1007/s11227-017-2008-8
2017-01-01
Abstract:Record Matching refers to identifying pairs of records that relate to the same entities across different data sources. In many applications of data mining, record matching is usually associated to quadratic complexity. In practice, the number of non-matching record pairs always far exceeds the number of matching pairs, and this is called imbalance problem. Blocking is a technique of data reduction, which can filter unlikely matching pairs before record matching. However, for big data there is no fast and effective blocking algorithm yet. In this paper, we report on big data infrastructure to improve efficiency of blocking. Our approach runs blocking process independently and distributedly on the partitions of whole data. To improve efficiency, we adopt a probabilistic technique to balance the speed and the effect of the algorithm that we proposed for distributed blocking. Our experimental analysis endorses the superiority of our technique and shows its novel scalability.
What problem does this paper attempt to address?