Efficient Duplicate Record Detection Based on Similarity Estimation

Mohan Li,Hongzhi Wang,Jianzhong Li,Hong Gao
DOI: https://doi.org/10.1007/978-3-642-14246-8_58
2010-01-01
Abstract:In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them and the similarity is measured as the weight of such matching. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in this paper. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O(1) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.
What problem does this paper attempt to address?