Relating Left and Right Extensions of Maximal Repeats

Shunsuke Inenaga,Dmitry Kosolobov
2024-10-21
Abstract:The compact directed acyclic word graph (CDAWG) of a string $T$ is an index occupying $O(\mathsf{e})$ space, where $\mathsf{e}$ is the number of right extensions of maximal repeats in $T$. For highly repetitive datasets, the measure $\mathsf{e}$ typically is small compared to the length $n$ of $T$ and, thus, the CDAWG serves as a compressed index. Unlike other compressibility measures (as LZ77, string attractors, BWT runs, etc.), $\mathsf{e}$ is very unstable with respect to reversals: the CDAWG of the reversed string $\overset{{}_{\leftarrow}}{T} = T[n] \cdots T[2] T[1]$ has size $O(\overset{{}_{\leftarrow}}{\mathsf{e}})$, where $\overset{{}_{\leftarrow}}{\mathsf{e}}$ is the number of left extensions of maximal repeats in $T$, and there are strings $T$ with $\frac{\overset{{}_{\leftarrow}}{\mathsf{e}}}{\mathsf{e}} \in \Omega(\sqrt{n})$. In this note, we prove that this lower bound is tight: $\frac{\overset{{}_{\leftarrow}}{\mathsf{e}}}{\mathsf{e}} \in O(\sqrt{n})$. Furthermore, given the alphabet size $\sigma$, we establish the alphabet-dependent bound $\frac{\overset{{}_{\leftarrow}}{\mathsf{e}}}{\mathsf{e}} \le \min\{\frac{2n}{\sigma}, \sigma\}$ and we show that it is asymptotically tight.
Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is related to the space complexity of Compact Directed Acyclic Word Graphs (CDAWGs) in string processing, especially the relationship between the number of left - and right - extensions of the maximum repeated substrings. Specifically, the paper focuses on: 1. **Space complexity of CDAWG**: CDAWG is a data structure for string indexing, and the space it occupies is \(O(e)\), where \(e\) is the number of right - extensions of the maximum repeated substrings in string \(T\). For highly repetitive data sets, \(e\) is usually much smaller than the string length \(n\), so CDAWG can be used as a compressed index. 2. **Effect of reversing the string**: Unlike other compressibility measures (such as LZ77, string attractors, BWT runs, etc.), \(e\) is very unstable for string reversal. The size of the CDAWG of the reversed string \(\leftarrow T=T[n]\cdots T[2]T[1]\) is \(O(\vec{e})\), where \(\vec{e}\) is the number of left - extensions of the maximum repeated substrings in string \(T\). There exists a series of strings \(T\) such that \(\frac{\vec{e}}{e}\in\Omega(\sqrt{n})\). 3. **Proof of upper and lower bounds**: The main contribution of the paper is to prove that this lower bound is tight, that is, \(\frac{\vec{e}}{e}\in O(\sqrt{n})\). In addition, given the alphabet size \(\sigma\), the paper also establishes an alphabet - dependent upper bound \(\frac{\vec{e}}{e}\leq\min\left\{\frac{2n}{\sigma},\sigma\right\}\) and proves that this upper bound is also asymptotically tight. ### Summary of the core problems in the paper: - **Problem description**: Study the relationship between the number of left - and right - extensions of the maximum repeated substrings, especially the change in space complexity in the case of string reversal. - **Objective**: Prove the upper and lower bounds on \(\vec{e}\) and \(e\), and show that these bounds are tight. - **Significance**: By improving the space complexity of CDAWG, it can be improved from \(O(e + \vec{e})\) to \(O(e)\), thereby significantly reducing the space requirements of some CDAWG - based data structures. ### Key results: - **Tight upper and lower bounds**: Prove that \(\frac{\vec{e}}{e}\in O(\sqrt{n})\) and \(\frac{\vec{e}}{e}\leq\min\left\{\frac{2n}{\sigma},\sigma\right\}\). - **Example construction**: Give a method for constructing strings that reach these bounds. These results are helpful for better understanding and optimizing CDAWG - based string indexing structures, especially the space efficiency problem when dealing with highly repetitive data.