Verification method for string similarity joins based on bi-directional filtering

Ying HUANG,Chunhua SONG,Baoning NIU
DOI: https://doi.org/10.3778/j.issn.1002-8331.1512-0309
2017-01-01
Abstract:A string similarity join finds similar string pairs from two sets of strings. It plays an important role in many real-world applications. Various algorithms have been proposed to address its efficiency issues. Partition-based filter-veri-fication methods, such as Pass-Join, are promising, which quickly screens out possible similar string pairs(candidate set)by searching partitioned parts of a string in another string, in order of increasing length, and then performs similarity verification based on edit-distance. Motivated by the fact that the effect produced by filtering in the descending order of string length is better than in the ascending order, a novel bi-directional filtering-verification mechanism is proposed. At the filtering stage, it pipelines the results from length descending filtering to length ascending filtering to further reduce the size of the candidate set. At the verification stage, it makes use of the two pairs of matched substrings from the bi-directional filtering to partition the target string pairs into several short substring pairs to accelerate the verification process. Experi-mental results show that the proposed bi-directional filtering-verification algorithm outperforms the origin algorithm on real-world datasets.
What problem does this paper attempt to address?