Faster two-dimensional pattern matching with $k$ mismatches

Jonas Ellert,Paweł Gawrychowski,Adam Górkiewicz,Tatiana Starikovskaya
2024-10-29
Abstract:The classical pattern matching asks for locating all occurrences of one string, called the pattern, in another, called the text, where a string is simply a sequence of characters. Due to the potential practical applications, it is desirable to seek approximate occurrences, for example by bounding the number of mismatches. This problem has been extensively studied, and by now we have a good understanding of the best possible time complexity as a function of $n$ (length of the text), $m$ (length of the pattern), and $k$ (number of mismatches). In particular, we know that for $k=\mathcal{O}(\sqrt{m})$, we can achieve quasi-linear time complexity [Gawrychowski and Uznański, ICALP 2018]. We consider a natural generalisation of the approximate pattern matching problem to two-dimensional strings, which are simply square arrays of characters. The exact version of this problem has been extensively studied in the early 90s. While periodicity, which is the basic tool for one-dimensional pattern matching, admits a natural extension to two dimensions, it turns out to become significantly more challenging to work with, and it took some time until an alphabet-independent linear-time algorithm has been obtained by Galil and Park [SICOMP 1996]. In the approximate two-dimensional pattern matching, we are given a pattern of size $m\times m$ and a text of size $n\times n$, and ask for all locations in the text where the pattern matches with at most $k$ mismatches. The asymptotically fastest algorithm for this algorithm works in $\mathcal{O}(kn^{2})$ time [Amir and Landau, TCS 1991]. We provide a new insight into two-dimensional periodicity to improve on these 30-years old bounds. Our algorithm works in $\tilde{\mathcal{O}}((m^{2}+mk^{5/4})n^{2}/m^{2})$ time, which is $\tilde{\mathcal{O}}(n^{2})$ for $k=\mathcal{O}(m^{4/5})$.
Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the pattern - matching problem with at most \(k\) mismatches in two - dimensional strings. Specifically, given a pattern string of size \(m\times m\) and a text string of size \(n\times n\), the goal is to find all positions in the text that have at most \(k\) mismatches with the pattern string. ### Background and Problem Definition **One - Dimensional Pattern Matching**: - The classical one - dimensional pattern - matching problem is to find all occurrences of one string (called the pattern) in another string (called the text). - For practical applications, it is often necessary to seek approximate matches, for example, by limiting the number of mismatches. - For \(k = O(\sqrt{m})\), there already exist algorithms with quasi - linear time complexity (Gawrychowski and Uznański, ICALP 208). **Two - Dimensional Pattern Matching**: - The exact version of the two - dimensional pattern - matching problem was widely studied in the early 1990s, mainly applied in image processing. - Two - dimensional periodicity is a basic tool, but it is more complex than one - dimensional periodicity. - For two - dimensional approximate pattern - matching, the fastest known algorithm runs in \(O(kn^{2})\) time (Amir and Landau, TCS 1991). ### Main Contributions of the Paper The paper provides a new insight into two - dimensional periodicity to improve the bounds of 30 years ago. Specifically, the paper proposes a new algorithm that runs in \(\tilde{O}\left(\frac{(m^{2}+mk^{5/4})n^{2}}{m^{2}}\right)\) time, and for \(k = O(m^{4/5})\), the time complexity of this algorithm is \(\tilde{O}(n^{2})\). ### Technical Overview 1. **Generalization from One - Dimensional to Two - Dimensional**: - Using techniques such as Karloff algorithm and Fast Fourier Transform (FFT), generalize the one - dimensional pattern - matching algorithm to two - dimensional. - By linearizing two - dimensional strings, convert them into one - dimensional strings, so as to utilize the existing efficient algorithms. 2. **Handling of Two - Dimensional Periodicity**: - Define two - dimensional approximate periods and prove that the difference \(u - v\) between each element \(u\) and \(v\) in the given set \(Q\) is an \(O(k)\)-period of the pattern string \(P\). - Use geometric properties and Dilworth's theorem to find two \(O(k)\)-periods \(\phi\) and \(\psi\) with large angles. 3. **Partitioning of Pattern and Text**: - Define tile and truncated tile, and use these concepts to partition the pattern and text into single - character strings. - By pre - processing \(\phi\) and \(\psi\), a truncated tile string \(R\) can be partitioned into \(O(k)\) monochromatic truncated sub - tile strings in \(\tilde{O}(|\text{dom}(R)|+k)\) time. ### Conclusion By introducing a new method for handling two - dimensional periodicity, the paper significantly improves the time complexity of the existing two - dimensional approximate pattern - matching algorithms, especially achieving quasi - linear time complexity when \(k\) is small. This result is of great significance for applications in image processing and other related fields.