Small space and streaming pattern matching with k edits

Tomasz Kociumaka,Ely Porat,Tatiana Starikovskaya
DOI: https://doi.org/10.48550/arXiv.2106.06037
2021-06-11
Abstract:In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer $k$, a pattern $P$ of length $m$, and a text $T$ of length $n \ge m$, the task is to find substrings of $T$ that are within edit distance $k$ from $P$. Our main result is a streaming algorithm that solves the problem in $\tilde{O}(k^5)$ space and $\tilde{O}(k^8)$ amortised time per character of the text, providing answers correct with high probability. (Hereafter, $\tilde{O}(\cdot)$ hides a $\mathrm{poly}(\log n)$ factor.) This answers a decade-old question: since the discovery of a $\mathrm{poly}(k\log n)$-space streaming algorithm for pattern matching under Hamming distance by Porat and Porat [FOCS 2009], the existence of an analogous result for edit distance remained open. Up to this work, no $\mathrm{poly}(k\log n)$-space algorithm was known even in the simpler semi-streaming model, where $T$ comes as a stream but $P$ is available for read-only access. In this model, we give a deterministic algorithm that achieves slightly better complexity. In order to develop the fully streaming algorithm, we introduce a new edit distance sketch parametrised by integers $n\ge k$. For any string of length at most $n$, the sketch is of size $\tilde{O}(k^2)$ and it can be computed with an $\tilde{O}(k^2)$-space streaming algorithm. Given the sketches of two strings, in $\tilde{O}(k^3)$ time we can compute their edit distance or certify that it is larger than $k$. This result improves upon $\tilde{O}(k^8)$-size sketches of Belazzougui and Zhu [FOCS 2016] and very recent $\tilde{O}(k^3)$-size sketches of Jin, Nelson, and Wu [STACS 2021].
Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is approximate pattern matching with edit - distance constraints in the streaming data model. Specifically, given an integer \(k\), a pattern \(P\) of length \(m\), and a text \(T\) of length \(n\) (\(n\geq m\)), the task is to find substrings in the text \(T\) whose edit distance from the pattern \(P\) is no more than \(k\). The edit distance refers to the minimum number of operations (insertion, deletion, or substitution of characters) required to make two strings identical. ### Main contributions of the paper 1. **Streaming algorithm**: - A streaming algorithm is proposed, which solves the problem with \(\tilde{O}(k^{5})\) space and \(\tilde{O}(k^{8})\) amortized time complexity per character, and the accuracy rate of the answer is very high. Here, \(\tilde{O}(\cdot)\) represents hiding the \(\text{poly}(\log n)\) factor. - This result answers a decade - long question: Since Porat and Porat [FOCS 2009] discovered the multi - logarithmic - space streaming algorithm based on the Hamming distance, whether there exists a similar edit - distance streaming algorithm has been an open question. 2. **Deterministic algorithm in the semi - streaming model**: - In the semi - streaming model, the text arrives in the form of a stream, but the pattern can be accessed in read - only mode. In this model, a deterministic algorithm is proposed, with a space complexity of \(\tilde{O}(k^{5})\) and an amortized time complexity of \(\tilde{O}(k^{6})\) per character. 3. **New edit - distance sketch**: - A new edit - distance sketch is designed to retrieve the edit distances (up to \(k\)) between strings of length no more than \(n\). The size of this sketch is \(\tilde{O}(k^{2})\), and it can calculate the edit distance between two strings or confirm that it is greater than \(k\) in \(\tilde{O}(k^{2})\) space and \(\tilde{O}(k^{3})\) time. - This result improves the \(\tilde{O}(k^{8})\) - sized sketch of Belazzougui and Zhang [FOCS 2016] and the \(\tilde{O}(k^{3})\) - sized sketch of Jin, Nelson, and Wu [STACS 2021]. ### Technical contributions 1. **Greedy encoding**: - A new space - efficient deterministic encoding method, called Greedy Encoding, is introduced to encode all alignments of two strings with costs no more than \(k\). - Through this encoding, substrings in the text close to the pattern can be compressed. Specifically, for strings of length no more than \(n\), the encoding occupies \(\tilde{O}(k^{2})\) space. 2. **Edit - distance sketch**: - A new edit - distance sketch with the parameter of an integer \(n\geq k\) is designed. For any string of length no more than \(n\), the size of the sketch is \(\tilde{O}(k^{2})\), and it can be calculated by a streaming algorithm with \(\tilde{O}(k^{2})\) space. - Given the sketches of two strings, their edit distance can be calculated or it can be confirmed that it is greater than \(k\) in \(\tilde{O}(k^{3})\) time. ### Conclusion By introducing new encoding methods and edit - distance sketches, this paper successfully solves the approximate pattern - matching problem with edit - distance constraints in the streaming data model, fills the gap in this field, and provides new tools and methods for future related research.