Improved LSH-driven String Similarity Join Filtering-Verification Framework

Jingwei Zhang,Ru Chen,Qing Yang
DOI: https://doi.org/10.1504/ijiitc.2020.10032171
2020-01-01
International Journal of Intelligent Internet of Things Computing
Abstract:Similarity join is a basic data analysis operation, which is widely used in the fields of similarity search, data cleaning and recommendation application. The filtering-verification framework is one of the main modes to implement similarity join. In view of high-dimensional data and high edit distance threshold, a filtering-verification framework based on locality-sensitive hashing (LSH) is proposed, which adopts dual filtering mode to effectively balance the number of both false positive and false negative, thereby improving the efficiency and accuracy of similarity join. Experimental results show that the similarity join filtering-verification framework based on LSH can effectively reduce the number of false positive, and it has a significant improvement in efficiency compared with the traditional method based on edit distance.
What problem does this paper attempt to address?