On pattern matching with k mismatches and few don't cares

Marius Nicolae,Sanguthevar Rajasekaran
DOI: https://doi.org/10.1016/j.ipl.2016.10.003
2016-10-29
Abstract:We consider the problem of pattern matching with $k$ mismatches, where there can be don't care or wild card characters in the pattern. Specifically, given a pattern $P$ of length $m$ and a text $T$ of length $n$, we want to find all occurrences of $P$ in $T$ that have no more than $k$ mismatches. The pattern can have don't care characters, which match any character. Without don't cares, the best known algorithm for pattern matching with $k$ mismatches has a runtime of $O(n\sqrt{k \log k})$. With don't cares in the pattern, the best deterministic algorithm has a runtime of $O(nk polylog m)$. Therefore, there is an important gap between the versions with and without don't cares. In this paper we give an algorithm whose runtime increases with the number of don't cares. We define an {\em island} to be a maximal length substring of $P$ that does not contain don't cares. Let $q$ be the number of islands in $P$. We present an algorithm that runs in $O(n\sqrt{k\log m}+n\min\{\sqrt[3]{qk\log^2 m},\sqrt{q\log m}\})$ time. If the number of islands $q$ is $O(k)$ this runtime becomes $O(n\sqrt{k\log m})$, which essentially matches the best known runtime for pattern matching with $k$ mismatches without don't cares. If the number of islands $q$ is $O(k^2)$, this algorithm is asymptotically faster than the previous best algorithm for pattern matching with $k$ mismatches with don't cares in the pattern.
Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is pattern matching in patterns containing "wild - card" or "don't - care" characters and allowing at most k mismatches. Specifically, given a pattern P of length m and a text T of length n, the goal is to find all positions i in T such that the Hamming distance between P and the substring \(T_i\) of T does not exceed the given threshold k. Here, the Hamming distance refers to the number of different characters in the same positions of two strings. Pattern P can contain "don't - care" characters, which can match any character. The paper proposes a new algorithm whose running time depends on the number q of "islands" in the pattern. An "island" is defined as the longest substring in pattern P that does not contain "don't - care" characters. In this way, the authors provide an algorithm with a running time of \(O(n\sqrt{k}\log m + n \min\{3\sqrt{qk}\log^2 m, \sqrt{q}\log m\})\). If the number q of "islands" in the pattern is \(O(k)\), then the running time of this algorithm can reach \(O(n\sqrt{k}\log m)\), which almost matches the running time of the best - known algorithm when there are no "don't - care" characters. For the case where q is \(O(k^2)\), this algorithm is faster than the previous best algorithm \(O(nk \text{polylog} m)\).