Lempel-Ziv (LZ77) Factorization in Sublinear Time

Dominik Kempa,Tomasz Kociumaka
2024-09-19
Abstract:Lempel-Ziv (LZ77) factorization is a fundamental problem in string processing: Greedily partition a given string $T$ from left to right into blocks (called phrases) so that each phrase is either the leftmost occurrence of a letter or the longest prefix of the unprocessed suffix that has another occurrence earlier in $T$. Due to numerous applications, LZ77 factorization is one of the most studied problems on strings. In the 47 years since its inception, several algorithms were developed for different models of computation, including parallel, GPU, external-memory, and quantum. Remarkably, however, the complexity of the most basic variant is still not settled: All existing algorithms in the RAM model run in $\Omega(n)$ time, which is a $\Theta(\log n)$ factor away from the lower bound of $\Omega(n/\log n)$ (following from the necessity to read the input, which takes $\Theta(n/\log n)$ space for $T\in\{0,1\}^{n}$). We present the first $o(n)$-time algorithm for LZ77 factorization, breaking the linear-time barrier present for nearly 50 years. For $T\in\{0,1\}^{n}$, our algorithm runs in $\mathcal{O}(n/\sqrt{\log n})=o(n)$ time and uses the optimal $\mathcal{O}(n/\log n)$ working space. Our algorithm generalizes to $\Sigma=[0..\sigma)$, where $\sigma=n^{\mathcal{O}(1)}$. The runtime and working space then become $\mathcal{O}((n\log\sigma)/\sqrt{\log n})$ and $\mathcal{O}(n/\log_{\sigma} n)$. To obtain our algorithm, we prove a more general result: For any constant $\epsilon\in(0,1)$ and $T\in[0..\sigma)^{n}$, in $\mathcal{O}((n\log\sigma)/\sqrt{\log n})$ time and using $\mathcal{O}(n/\log_{\sigma}n)$ space, we can construct an $\mathcal{O}(n/\log_{\sigma}n)$-size index that, given any $P=T[j..j+\ell)$ (represented as $(j,\ell)$), computes the leftmost occurrence of $P$ in $T$ in $\mathcal{O}(\log^{\epsilon}n)$ time. In other words, we solve the indexing/online variant of the LZ77 problem.
Data Structures and Algorithms
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the computational complexity problem of Lempel - Ziv (LZ77) factorization, especially how to complete LZ77 factorization in sublinear time. Specifically: 1. **Background and problem definition**: - LZ77 factorization is a fundamental problem in string processing. It greedily divides a given string from left to right into several blocks (called phrases). Each phrase is either a single character that appears for the first time or the longest prefix in the unprocessed part that has already appeared in the text earlier. - LZ77 factorization is widely used in data compression (such as zip, pdf, png formats), repeated pattern detection, and compressed indexing, etc. 2. **Limitations of existing algorithms**: - Although LZ77 factorization has been intensively studied in the past 47 years and a variety of efficient algorithms have been developed, all existing RAM - model algorithms require a time complexity of at least \( \Omega(n) \). - This means that in the worst - case scenario, the time complexity of all existing algorithms reaches linear time, that is, \( O(n) \), which has become a bottleneck. 3. **Main contributions of the paper**: - The author proposes the first algorithm that can complete LZ77 factorization in \( o(n) \) time, breaking the linear - time barrier that has lasted for nearly 50 years. - For the binary alphabet (\( \sigma = 2 \)), this algorithm can run in \( O\left(\frac{n}{\sqrt{\log n}}\right) \) time and use the optimal \( O\left(\frac{n}{\log n}\right) \) space. - For a larger integer alphabet \( \Sigma = [0, \sigma) \), the time complexity of the algorithm is \( O\left(\frac{n \log \sigma}{\sqrt{\log n}}\right) \) and the space complexity is \( O\left(\frac{n}{\log_\sigma n}\right) \). 4. **Implementation methods**: - The author constructs an index that can quickly locate the first occurrence position of a substring by introducing a new query type - prefix range minimum query (prefix RMQ). - This index not only supports fast queries for substrings at any position and of any length, but is also suitable for explicit pattern queries. - Through this index, the author realizes the efficient construction of LZ77 factorization, thus achieving sublinear time complexity. In conclusion, this paper solves the long - standing bottleneck of linear - time complexity in LZ77 factorization and provides a more efficient solution, which has important theoretical and practical application values.