A Partition-Based Bi-directional Filtering Method for String Similarity JOINs.

Ying Huang,Baoning Niu,Chunhua Song
DOI: https://doi.org/10.1007/978-3-319-21042-1_32
2015-01-01
Abstract:A string similarity join finds similar string pairs from two sets of strings, which is frequently found in many applications, such as duplicate detection, data integration and cleaning. Various algorithms have been proposed to address its efficiency issues. Partition-based filtering methods, such as Pass-JOIN, are promising, which quickly screens out possible similar string pairs by searching partitioned parts of a string in another string, in order of increasing length, and then performs similarity verification base on edit-distance. We notice that, filtering with different direction produces different candidate sets, which motivate us using a bi-directional filtering mechanism. This paper proposes a novel bi-directional filtering mechanism to enhance the filtering capability, which pipelines filtered results in forward direction to the process of backward filtering. The substring selection method of Pass-JOIN is adapted for the backward filtering. Experimental results show that the proposed bi-directional filtering algorithm outperforms the origin algorithm on real-world datasets.
What problem does this paper attempt to address?