Preference-driven Similarity Join

Chuancong Gao,Jiannan Wang,Jian Pei,Rui Li,Yi Chang
DOI: https://doi.org/10.1145/3106426.3106484
2017-01-01
Abstract:Similarity join, which can find similar objects (e.g., products, names, addresses) across different sources, is powerful in dealing with variety in big data, especially web data. Threshold-driven similarity join, which has been extensively studied in the past, assumes that a user is able to specify a similarity threshold, and then focuses on how to efficiently return the object pairs whose similarities pass the threshold. We argue that the assumption about a well set similarity threshold may not be valid for two reasons. The optimal thresholds for different similarity join tasks may vary a lot. Moreover, the end-to-end time spent on similarity join is likely to be dominated by a back-and-forth threshold-tuning process.
What problem does this paper attempt to address?