Abstract:Lempel-Ziv (LZ77) factorization is a fundamental problem in string processing: Greedily partition a given string $T$ from left to right into blocks (called phrases) so that each phrase is either the leftmost occurrence of a letter or the longest prefix of the unprocessed suffix that has another occurrence earlier in $T$. Due to numerous applications, LZ77 factorization is one of the most studied problems on strings. In the 47 years since its inception, several algorithms were developed for different models of computation, including parallel, GPU, external-memory, and quantum. Remarkably, however, the complexity of the most basic variant is still not settled: All existing algorithms in the RAM model run in $\Omega(n)$ time, which is a $\Theta(\log n)$ factor away from the lower bound of $\Omega(n/\log n)$ (following from the necessity to read the input, which takes $\Theta(n/\log n)$ space for $T\in\{0,1\}^{n}$). We present the first $o(n)$-time algorithm for LZ77 factorization, breaking the linear-time barrier present for nearly 50 years. For $T\in\{0,1\}^{n}$, our algorithm runs in $\mathcal{O}(n/\sqrt{\log n})=o(n)$ time and uses the optimal $\mathcal{O}(n/\log n)$ working space. Our algorithm generalizes to $\Sigma=[0..\sigma)$, where $\sigma=n^{\mathcal{O}(1)}$. The runtime and working space then become $\mathcal{O}((n\log\sigma)/\sqrt{\log n})$ and $\mathcal{O}(n/\log_{\sigma} n)$. To obtain our algorithm, we prove a more general result: For any constant $\epsilon\in(0,1)$ and $T\in[0..\sigma)^{n}$, in $\mathcal{O}((n\log\sigma)/\sqrt{\log n})$ time and using $\mathcal{O}(n/\log_{\sigma}n)$ space, we can construct an $\mathcal{O}(n/\log_{\sigma}n)$-size index that, given any $P=T[j..j+\ell)$ (represented as $(j,\ell)$), computes the leftmost occurrence of $P$ in $T$ in $\mathcal{O}(\log^{\epsilon}n)$ time. In other words, we solve the indexing/online variant of the LZ77 problem.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the computational complexity problem of Lempel - Ziv (LZ77) factorization, especially how to complete LZ77 factorization in sublinear time. Specifically: 1. **Background and problem definition**: - LZ77 factorization is a fundamental problem in string processing. It greedily divides a given string from left to right into several blocks (called phrases). Each phrase is either a single character that appears for the first time or the longest prefix in the unprocessed part that has already appeared in the text earlier. - LZ77 factorization is widely used in data compression (such as zip, pdf, png formats), repeated pattern detection, and compressed indexing, etc. 2. **Limitations of existing algorithms**: - Although LZ77 factorization has been intensively studied in the past 47 years and a variety of efficient algorithms have been developed, all existing RAM - model algorithms require a time complexity of at least $ \Omega(n) $. - This means that in the worst - case scenario, the time complexity of all existing algorithms reaches linear time, that is, $ O(n) $, which has become a bottleneck. 3. **Main contributions of the paper**: - The author proposes the first algorithm that can complete LZ77 factorization in $ o(n) $ time, breaking the linear - time barrier that has lasted for nearly 50 years. - For the binary alphabet ($ \sigma = 2 $), this algorithm can run in $ O\left(\frac{n}{\sqrt{\log n}}\right) $ time and use the optimal $ O\left(\frac{n}{\log n}\right) $ space. - For a larger integer alphabet $ \Sigma = [0, \sigma) $, the time complexity of the algorithm is $ O\left(\frac{n \log \sigma}{\sqrt{\log n}}\right) $ and the space complexity is $ O\left(\frac{n}{\log_\sigma n}\right) $. 4. **Implementation methods**: - The author constructs an index that can quickly locate the first occurrence position of a substring by introducing a new query type - prefix range minimum query (prefix RMQ). - This index not only supports fast queries for substrings at any position and of any length, but is also suitable for explicit pattern queries. - Through this index, the author realizes the efficient construction of LZ77 factorization, thus achieving sublinear time complexity. In conclusion, this paper solves the long - standing bottleneck of linear - time complexity in LZ77 factorization and provides a more efficient solution, which has important theoretical and practical application values.

Lempel-Ziv (LZ77) Factorization in Sublinear Time

Space Efficient Linear Time Lempel-Ziv Factorization on Constant~Size~Alphabets

Substring Compression Variations and LZ78-Derivates

Decompressing Lempel-Ziv Compressed Text

Range Predecessor and Lempel-Ziv Parsing

Near-Optimal Quantum Algorithms for Bounded Edit Distance and Lempel-Ziv Factorization

Computing the LZ-End parsing: Easy to implement and practically efficient

Faster and simpler online/sliding rightmost Lempel-Ziv factorizations

BAT-LZ Out of Hell

Height-bounded Lempel-Ziv encodings

On Abelian Longest Common Factor with and without RLE

Simple Linear-time Repetition Factorization

Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing

On the complexity and approximability of Bounded access Lempel Ziv coding

A sublinear time quantum algorithm for longest common substring problem between run-length encoded strings

Longest Common Substring and Longest Palindromic Substring in $\tilde{\mathcal{O}}(\sqrt{n})$ Time

Integer Polynomial Factorization by Recombination of Real Factors: Re-evaluating an Old Technique in Modern Era

Fast and Simple Jumbled Indexing for Binary RLE Strings

Computing String Covers in Sublinear Time

Near-Optimal Quantum Algorithm for Finding the Longest Common Substring between Run-Length Encoded Strings

Online and Offline Algorithms for Counting Distinct Closed Factors via Sliding Suffix Trees