Locally consistent decomposition of strings with applications to edit distance sketching

Sudatta Bhattacharya,Michal Koucký
2023-11-27
Abstract:In this paper we provide a new locally consistent decomposition of strings. Each string $x$ is decomposed into blocks that can be described by grammars of size $\widetilde{O}(k)$ (using some amount of randomness). If we take two strings $x$ and $y$ of edit distance at most $k$ then their block decomposition uses the same number of grammars and the $i$-th grammar of $x$ is the same as the $i$-th grammar of $y$ except for at most $k$ indexes $i$. The edit distance of $x$ and $y$ equals to the sum of edit distances of pairs of blocks where $x$ and $y$ differ. Our decomposition can be used to design a sketch of size $\widetilde{O}(k^2)$ for edit distance, and also a rolling sketch for edit distance of size $\widetilde{O}(k^2)$. The rolling sketch allows to update the sketched string by appending a symbol or removing a symbol from the beginning of the string.
Data Structures and Algorithms
What problem does this paper attempt to address?
The main problem this paper attempts to address is the fast computation and approximation of edit distance, particularly in designing small and efficient edit distance sketches. Specifically: 1. **Alignment Challenge**: Finding the optimal alignment is a core issue when calculating the edit distance between two strings. This problem becomes even more challenging in sketching techniques, as it is not possible to access both strings simultaneously. 2. **New Decomposition Method**: The paper proposes a new method for locally consistent string decomposition, breaking each string into several blocks, each of which can be represented by a context-free grammar of size \(e^{O(k)}\). If the edit distance between two strings \(x\) and \(y\) does not exceed \(k\), their decompositions will use the same number of grammars, and except for at most \(k\) index positions, the grammars at other index positions will be the same. 3. **Sketch Design**: Based on the above decomposition method, the paper also designs edit distance sketches of size \(e^{O(k^2)}\) and rolling sketches, allowing the stored string to be updated by appending or removing characters. 4. **Optimization and Application**: This decomposition method is not only suitable for the approximate calculation of edit distance but can also be used to embed edit distance into Hamming distance. Compared to previous methods, it has more advantages in parallelization and is expected to have more applications in the future.