Faster Sublinear-Time Edit Distance

Karl Bringmann,Alejandro Cassis,Nick Fischer,Tomasz Kociumaka
2023-12-04
Abstract:We study the fundamental problem of approximating the edit distance of two strings. After an extensive line of research led to the development of a constant-factor approximation algorithm in almost-linear time, recent years have witnessed a notable shift in focus towards sublinear-time algorithms. Here, the task is typically formalized as the $(k, K)$-gap edit distance problem: Distinguish whether the edit distance of two strings is at most $k$ or more than $K$. Surprisingly, it is still possible to compute meaningful approximations in this challenging regime. Nevertheless, in almost all previous work, truly sublinear running time of $O(n^{1-\varepsilon})$ (for a constant $\varepsilon > 0$) comes at the price of at least polynomial gap $K \ge k \cdot n^{\Omega(\varepsilon)}$. Only recently, [Bringmann, Cassis, Fischer, and Nakos; STOC'22] broke through this barrier and solved the sub-polynomial $(k, k^{1+o(1)})$-gap edit distance problem in time $O(n/k + k^{4+o(1)})$, which is truly sublinear if $n^{\Omega(1)} \le k \le n^{\frac14-\Omega(1)}$.The $n/k$ term is inevitable (already for Hamming distance), but it remains an important task to optimize the $\mathrm{poly}(k)$ term and, in general, solve the $(k, k^{1+o(1)})$-gap edit distance problem in sublinear-time for larger values of $k$. In this work, we design an improved algorithm for the $(k, k^{1+o(1)})$-gap edit distance problem in sublinear time $O(n/k + k^{2+o(1)})$, yielding a significant quadratic speed-up over the previous $O(n/k + k^{4+o(1)})$-time algorithm. Notably, our algorithm is unconditionally almost-optimal (up to subpolynomial factors) in the regime where $k \leq n^{\frac13}$ and improves upon the state of the art for $k \leq n^{\frac12-o(1)}$.
Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is about the approximate calculation of string edit distance, especially the efficient algorithm design in sub - linear time. Specifically, the paper focuses on how, given two strings \(X\) and \(Y\), to distinguish whether the edit distance between them is less than or equal to \(k\) or greater than \(K\), that is, the so - called \((k, K)\)-gap edit distance problem. The edit distance is defined as the minimum number of character insertions, deletions or substitutions required to transform one string into another. ### Core Problems of the Paper 1. **\((k, k^{1 + o(1)})\)-gap Edit Distance Problem**: - Can the \((k, k^{1+o(1)})\)-gap edit distance problem be solved in truly sub - linear time? - Specifically, when \(k\geq n^{\Omega(1)}\), does there exist an algorithm with a running time of \(n^{1-\Omega(1)}\)? 2. **\((k, O(k))\)-gap Edit Distance Problem**: - Can the \((k, O(k))\)-gap edit distance problem be solved in \(O(n/k+\text{poly}(k))\) time? ### Existing Research Background - **Early Research**: Early research mainly focused on polynomial - time algorithms for approximately calculating the edit distance, such as the almost - linear - time algorithm proposed by Andoni and Onak. - **Recent Progress**: In recent years, the research focus has shifted to sub - linear - time algorithms. For example, Bringmann et al. proposed an algorithm at STOC 2022 to solve the \((k, k^{1+o(1)})\)-gap edit distance problem in \(O(n/k + k^{4+o(1)})\) time, but its scope of application is limited. ### Main Contributions of This Paper 1. **Improved Algorithm**: - This paper proposes a new algorithm that can solve the \((k, k^{1+o(1)})\)-gap edit distance problem in \(O(n/k + k^{2+o(1)})\) time, significantly improving the previous time complexity of \(O(n/k + k^{4+o(1)})\). - This improvement makes the algorithm effective in the range of \(n^{\varepsilon}\leq k\leq n^{1/2-\varepsilon}\), expanding the scope of application of sub - linear - time algorithms. 2. **Block - Periodicity Decomposition**: - This paper introduces the concept of "block periodicity" and uses "breaks" to optimize the algorithm. - By detecting breaks in the string, the string can be effectively divided into multiple sub - problems, each with a smaller block - periodicity, thus simplifying the problem handling. 3. **Multi - Precision Sampling Technique**: - This paper also uses the multi - precision sampling technique, effectively reducing the number of sub - problems in the recursive process and further improving the efficiency of the algorithm. ### Technical Overview 1. **Block - Periodicity**: - Defines the block - periodicity \(BP_p(X)\) of a string, which represents that the string can be divided into several \(p\)-periodic substrings. - The block - periodicity of a string can be estimated by detecting breaks in the string. 2. **Use of Breaks**: - A break \(i\) is a position. If \(X[i..i + 3p)\) is not \(p\)-periodic, then \(i\) is a \(p\)-break. - Using breaks can divide the string into multiple substrings, each with a smaller block - periodicity, thus simplifying the problem handling. 3. **Algorithm Framework**: - Using the Andoni - Krauthgamer - Onak framework, combined with the multi - precision sampling technique, an efficient recursive algorithm is designed. - By optimizing the estimation of block - periodicity and the detection of breaks.