Large-Scale Similarity Join With Edit-Distance Constraints

Chen Lin,Haiyang Yu,Wei Weng,Xianmang He
DOI: https://doi.org/10.1007/978-3-319-05813-9_22
2014-01-01
Abstract:In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on Map-Reduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.
What problem does this paper attempt to address?