Hierarchical filtering: improving similar substring matching under edit distance
Tao Qiu,Chuanyu Zong,Xiaochun Yang,Bin Wang,Bing Li
DOI: https://doi.org/10.1007/s11280-022-01128-w
2022-12-07
World Wide Web
Abstract:Similar substring matching, as an essential operation in applications including read mapping and text retrieval, has attracted significant attention in the research community. In this paper, we study the problem of similar substring matching with edit distance constraints. Existing methods generally utilize a filtering-and-verification framework to solve this problem – a filtering procedure is employed to reduce the searching space before going to a computationally expensive verification step, and the efficiency depends critically on balancing the cost of filtering and verification. The common filtering paradigm is based on the principle of Pigeonhole stating that a matching result must exactly match at least a certain number of substrings from the query, where the substrings act as a filter. However, the polynomial growth of filters caused by enlarging the number of substrings in filters, leading to the cost of filtering and verification is not well-balanced for the existing methods. To this end, we propose a novel filtering paradigm hierarchical filtering , aiming at achieving a fine-grained balance on the cost of filtering and verification. Unlike using a fixed number of substrings in a filter, our method allows the filters contain a different number of substrings that avoids the polynomial growth of filters. The filters are picked in accord with a scoring metric. We devise a tree-based filtering framework for hierarchical filtering. Also, the cost of filtering and verification is further reduced by eliminating the duplication of filters. Extensive experiments have been conducted on four real-world datasets by comparing to state-of-the-art methods Hobbes3, BWA, and BLAST, etc. The results show that our method outperforms the competing methods under a wide range of parameter settings.