Scalable Metric Similarity Join Using MapReduce

Jiacheng Wu,Yong Zhang,Jin Wang,Chunbin Lin,Yingjia Fu,Chunxiao Xing
DOI: https://doi.org/10.1109/icde.2019.00167
2019-01-01
Abstract:Given two collections of objects, metric similarity join finds all similar pairs of objects according to a particular distance function in metric space. There is an increasing demand to provide a scalable similarity join algorithm which can support efficient query and analytical services in the era of Big Data. In this paper, we propose SMS-Join, a parallel framework to support similarity join in metric space based on the MapReduce paradigm. The overall workflow of SMS-Join is that it first finds some records as pivots in the preprocessing phase and then splits the data into partitions based on them with a map job. Finally the join results are obtained via a reduce job. To ensure load balancing between the partitions, we devise a light-weighted sampling technique to obtain high quality samples while maintaining the high performance. To reduce the partition cost, we develop an iterative partition strategy in the map phase. We implement our framework upon Apache Spark platform and conduct extensive experiments on four real world datasets. The results show that our method significantly outperforms state-of-the-art methods.
What problem does this paper attempt to address?