Extending String Similarity Join to Tolerant Fuzzy Token Matching

Jiannan Wang,Guoliang Li,Jianhua Feng
DOI: https://doi.org/10.1145/2535628
IF: 1.6289
2014-01-01
ACM Transactions on Database Systems
Abstract:String similarity join that finds similar string pairs between two string sets is an essential operation in many applications and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this article, we propose a new similarity function, called fuzzy-token-matching-based similarity which extends token-based similarity functions (e.g., jaccard similarity and cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity function and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. We also extend our techniques to support weighted tokens. Experimental results show that our method achieves high efficiency and result quality and significantly outperforms state-of-the-art approaches.
What problem does this paper attempt to address?