An Improved Sketching Algorithm for Edit Distance

Ce Jin,Jelani Nelson,Kewen Wu
DOI: https://doi.org/10.4230/LIPIcs.STACS.2021.45
2021-05-02
Abstract:We provide improved upper bounds for the simultaneous sketching complexity of edit distance. Consider two parties, Alice with input $x\in\Sigma^n$ and Bob with input $y\in\Sigma^n$, that share public randomness and are given a promise that the edit distance $\mathsf{ed}(x,y)$ between their two strings is at most some given value $k$. Alice must send a message $sx$ and Bob must send $sy$ to a third party Charlie, who does not know the inputs but shares the same public randomness and also knows $k$. Charlie must output $\mathsf{ed}(x,y)$ precisely as well as a sequence of $\mathsf{ed}(x,y)$ edits required to transform $x$ into $y$. The goal is to minimize the lengths $|sx|, |sy|$ of the messages sent. The protocol of Belazzougui and Zhang (FOCS 2016), building upon the random walk method of Chakraborty, Goldenberg, and Koucký (STOC 2016), achieves a maximum message length of $\tilde O(k^8)$ bits, where $\tilde O(\cdot)$ hides $\mathrm{poly}(\log n)$ factors. In this work we build upon Belazzougui and Zhang's protocol and provide an improved analysis demonstrating that a slight modification of their construction achieves a bound of $\tilde O(k^3)$.
Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: when the edit distance between two strings is no more than a given value \(k\), how to design an efficient communication protocol so that two participants (Alice and Bob) can send the shortest possible messages to a third party (Charlie), enabling Charlie to accurately calculate the edit distance between these two strings and provide an optimal edit sequence to convert one string into the other. ### Problem Background The edit distance is a measure of the difference between two strings, defined as the minimum number of insertions, deletions, or substitutions required to convert one string into the other. In many applications, such as information retrieval, natural language processing, and bioinformatics, the edit distance is an important metric. The traditional Wagner - Fischer algorithm can calculate the edit distance between two strings of length \(n\) in \(O(n^2)\) time, but this algorithm has a high time complexity. ### Problem Description Specifically, the paper considers the following problem settings: - Alice has a string \(x\) of length \(n\), and Bob has a string \(y\) of length \(n\). - They share common randomness, and it is known that the edit distance \(\text{ed}(x, y)\) between \(x\) and \(y\) is no more than a given value \(k\). - Alice needs to send a message \(s_x\), and Bob needs to send a message \(s_y\) to Charlie. Charlie does not know the input strings but shares the same common randomness and knows \(k\). - Charlie must accurately output \(\text{ed}(x, y)\) and the optimal edit sequence required to convert \(x\) into \(y\). - The goal is to minimize the message lengths \(|s_x|\) and \(|s_y|\) sent by Alice and Bob. ### Previous Work Previous work (such as the protocol of Belazzougui and Zhang) has proposed a method in which the length of the message sent by each participant is \(\tilde{O}(k^8)\) bits. Here, \(\tilde{O}\) represents ignoring the polylogarithmic factor. ### Main Contributions of the Paper This paper, by improving previous techniques, especially the improvement of the CGK random walk method, proves that a slightly modified protocol can reduce the message length to \(\tilde{O}(k^3)\) bits. More specifically, the upper bound of the message length is \(O(k^3\log^2(n / \delta)\log n)\) bits. ### Main Technical Contributions 1. **Reducing the Number of Random Walks**: This paper reduces the number of required random walks by improving the analysis of the CGK random walk. Specifically, this paper proves that under certain conditions, the probability that two events (a certain edge is not in the optimal matching and the number of progress steps of the random walk is small) occur simultaneously is high, thus avoiding the large factor depending on \(k\) brought by using the union bound in previous work. 2. **Improved Analysis of the CGK Random Walk**: This paper improves the upper - bound estimate of the number of progress steps by a more detailed analysis of the CGK random walk, thereby further reducing the required message length. ### Conclusion This paper significantly reduces the message length in the edit - distance communication problem by improving existing techniques, providing new ideas and methods for efficiently processing large - scale string data.