Abstract:Similar substring matching, as an essential operation in applications including read mapping and text retrieval, has attracted significant attention in the research community. In this paper, we study the problem of similar substring matching with edit distance constraints. Existing methods generally utilize a filtering-and-verification framework to solve this problem – a filtering procedure is employed to reduce the searching space before going to a computationally expensive verification step, and the efficiency depends critically on balancing the cost of filtering and verification. The common filtering paradigm is based on the principle of Pigeonhole stating that a matching result must exactly match at least a certain number of substrings from the query, where the substrings act as a filter. However, the polynomial growth of filters caused by enlarging the number of substrings in filters, leading to the cost of filtering and verification is not well-balanced for the existing methods. To this end, we propose a novel filtering paradigm hierarchical filtering , aiming at achieving a fine-grained balance on the cost of filtering and verification. Unlike using a fixed number of substrings in a filter, our method allows the filters contain a different number of substrings that avoids the polynomial growth of filters. The filters are picked in accord with a scoring metric. We devise a tree-based filtering framework for hierarchical filtering. Also, the cost of filtering and verification is further reduced by eliminating the duplication of filters. Extensive experiments have been conducted on four real-world datasets by comparing to state-of-the-art methods Hobbes3, BWA, and BLAST, etc. The results show that our method outperforms the competing methods under a wide range of parameter settings.

Verification method for string similarity joins based on bi-directional filtering

Improved LSH-driven String Similarity Join Filtering-Verification Framework

A Partition-Based Bi-directional Filtering Method for String Similarity JOINs.

Efficient and Scalable Processing of String Similarity Join

A Partition-Based Method for String Similarity Joins with Edit-Distance Constraints

PASS-JOIN: A Partition-based Method for Similarity Joins

String Similarity Joins

Set Similarity Join Using Partition Index

Gfsf: A Novel Similarity Join Method Based On Frequency Vector

A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Pass-Join-K: Similarity Join Method Based on Multi-Match Partition

Trie-join: a Trie-Based Method for Efficient String Similarity Joins

LS-Join: Local Similarity Join on String Collections (extended Abstract).

Hierarchical filtering: improving similar substring matching under edit distance

Continuous similarity join on data streams

Efficient String Similarity Search: A Cross Pivotal Based Approach.

String similarity search and join: a survey

FrepJoin: an Efficient Partition-Based Algorithm for Edit Similarity Join

Hash(Ed)-Join: Approximate String Similarity Join With Hashing

Massjoin: A Mapreduce-Based Method for Scalable String Similarity Joins

Can we beat the prefix filtering?: an adaptive framework for similarity join and search.