A Pattern-Based Entity Resolution Algorithm
Hui-Ping LIU,Che-Qing JIN,Ao-Ying ZHOU
DOI: https://doi.org/10.11897/SP.J.1016.2015.01796
2015-01-01
Chinese Journal of Computers
Abstract:As a critical step in data integration and data cleaning,entity resolution (ER)aims at identifying groups of records that refer to the same real-world entity.Currently,there mainly exist two typical methods to handle this issue.One is exhaustive entity resolution,which compares all record pairs to determine the entity they belong to.However,its complexity (O(n2 ),n stands for the size of dataset)is too high to handle big volume dataset.The other is blocking-based entity resolution,which maps similar records to the same block by a specific method (e.g.,hash function, sliding window,etc).Then only the records in the same block need to be compared.This method improves the efficiency while sacrifices the effectiveness.Since some records refer to the same entity may not in the same block.In this paper we propose a pattern-based entity resolution, which represents the similar records by a record pattern,then we will generate a bound by comparing record patterns.With this bound,we can decide if the two patterns’corresponding records need to be precisely compared to verify whether they refer to the same entity.In this way,we can both dramatically accelerate the process of entity resolution by filtering dissimilar records and ensure its correctness.Experiments on real and synthetic dataset show the efficiency and effectiveness of our method.