Metric Similarity Joins Using Mapreduce (Extended Abstract)

Gang Chen,Keyu Yang,Lu Chen,Yunjun Gao,Baihua Zheng,Chun Chen
DOI: https://doi.org/10.1109/ICDE.2018.00251
2018-01-01
Abstract:Given two object sets Q and O, a metric similarity join finds similar object pairs according to a certain criterion. This operator has a wide range of applications in data cleaning, data mining, etc. In this paper, we employ a popular distributed framework, namely, MapReduce, to support scalable metric similarity joins. To ensure load balancing, we present two sampling based partition methods, i.e., clustering based partition method and KD-tree based partition method. To avoid unnecessary object pair evaluation, we propose a framework that maps the two involved object sets in order, where plane sweeping and pivot based filtering techniques are utilized for pruning. Extensive experiments confirm that our solution outperforms significantly existing state-of-the-art competitors.
What problem does this paper attempt to address?