Pattern Matching with Mismatches and Wildcards

Gabriel Bathie,Panagiotis Charalampopoulos,Tatiana Starikovskaya
2024-05-22
Abstract:In this work, we address the problem of approximate pattern matching with wildcards. Given a pattern $P$ of length $m$ containing $D$ wildcards, a text $T$ of length $n$, and an integer $k$, our objective is to identify all fragments of $T$ within Hamming distance $k$ from $P$.
Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to find approximate matches in pattern matching with wildcards. Specifically, given a pattern \(P\) of length \(m\), which contains \(D\) wildcards, a text \(T\) of length \(n\), and an integer \(k\), the goal is to identify all fragments in the text \(T\) whose Hamming distance from the pattern \(P\) is no more than \(k\). ### Main contributions of the paper 1. **Algorithm complexity**: - Proposed an algorithm with a running time of \(O(n+(D + k)(G + k)\cdot n/m)\), where \(G\leq D\) represents the number of the largest wildcard fragments in the pattern \(P\). - When \(D\), \(G\) and \(k\) are small relative to \(n\), this algorithm is superior to existing methods. For example, when \(m = n/2\), \(k = G=n^{2/5}\), and \(D = n^{3/5}\), this algorithm is completed in \(O(n)\) time, while previous methods require \(\Omega(n^{6/5})\) time. 2. **Exact pattern matching**: - For exact pattern matching (i.e., \(k = 0\)), proposed a simpler algorithm with a running time of \(O(n+DG\cdot n/m)\). - This algorithm is superior to the existing FFT - based algorithm with a time complexity of \(O(n\log m)\) when \(DG=o(m\log m)\). 3. **Structural features**: - Characterized the structure of \(k\)-mismatch occurrences, and proved that in a text of length \(O(m)\), these occurrences can be divided into \(O((D + k)(G + k))\) arithmetic progressions. - Constructed an infinite family of examples, showing that there are \(\Omega((D + k)k)\) arithmetic progressions of occurrences, using combinatorial results on sets without arithmetic progressions. ### Technical overview - **Sparsification technique**: - Introduced the concept of "sparsifiers", that is, selected certain positions in the pattern \(P\) that do not belong to any fragment with a wildcard density much greater than \(D/m\). - These sparsification positions are used as anchors to find approximate matches. - **Periodic structure**: - Utilized the periodic structure of strings, and processed periodic fragments by the sliding - window method. - When processing non - periodic fragments, by calculating the maximum fragment \(S'\) to align with the periodic fragments in the text \(T\), ensuring that the alignment positions do not coincide with wildcards. - **PILLAR model**: - Used the PILLAR model to abstract the basic primitives of string operations, such as the longest common prefix query, internal pattern matching query, etc. - By efficiently implementing these primitives, provided meta - algorithms applicable to multiple settings (such as the standard word RAM model, compressed settings, dynamic settings, and quantum settings). ### Conclusion This paper provides an efficient algorithm to solve the approximate pattern matching problem with wildcards by introducing the sparsification technique and periodic structure analysis. This algorithm is significantly superior to existing methods within certain parameter ranges and also provides improvements for exact pattern matching.