Abstract:The Burrows-Wheeler Transform (BWT) moves characters with similar contexts in a text together, where a character's context consists of the characters immediately following it. We say that a property has contextual locality if characters with similar contexts tend to have the same or similar values (``tags'') of that property. We argue that if we consider a repetitive text and such a property and the tags in their characters' BWT order, then the resulting string -- the text and property's {\em tag array} -- will be run-length compressible either directly or after some minor manipulation.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to explore how the Burrows - Wheeler Transform (BWT) can convert the properties with contextual locality into a compressible string form when dealing with such properties. Specifically, the author is concerned with whether the "tag arrays" generated can be efficiently compressed when we arrange the repeated texts and some of their properties (such as the species tags of characters, column numbers, etc.) in the BWT order.
### Specific problems include:
1. **The relationship between contextual locality and text locality**:
- The paper discusses contextual locality, that is, characters with similar contexts tend to have the same or similar property values ("tags"). The author studies the performance of these tags in the BWT order and points out that they usually form compressible strings.
2. **Characteristics of periodic texts**:
- The research of Mantaci, Restivo and Sciortino shows that in the BWT results of periodic texts, the number of runs is related to the period length of the original text. This is because BWT converts contextual locality into text locality, that is, it gathers characters with similar contexts together.
- However, the reverse process (from text locality to contextual locality) is not always valid, unless we consider the copies of the repeated substrings from which the characters are sourced.
3. **Extension to approximately periodic texts**:
- In practical applications, many strings are not strictly periodic but approximately periodic (such as the human genome). Therefore, it is very important to study the BWT properties of these approximately periodic texts and their impact on tag arrays.
4. **Compressibility of tag arrays**:
- Tag arrays refer to the results of arranging a certain property of characters (such as species tags, column numbers, etc.) in the BWT order. The paper explores whether these tag arrays can be efficiently compressed by run - length compression or other methods.
5. **Applications in bioinformatics**:
- The paper also discusses the applications of tag arrays in bioinformatics, especially for the classification of multi - species pan - genomes. By marking characters as genomes from different species, tag arrays can help identify and classify sequences of different species.
### Formula representation
- **Burrows - Wheeler Transform (BWT)**: Given a string \(T = t_0t_1\ldots t_{n - 1}\), BWT is obtained by sorting all suffixes \(t_it_{i + 1}\ldots t_{n - 1}t_0\ldots t_{i - 1}\) and extracting the first character of each suffix.
- **Longest Common Prefix (LCP)**: \(LCP(i)\) represents the length of the longest common prefix of suffixes \(S[i:]\) and \(S[i + 1:]\).
- **Interleaved Longest Common Prefix (ILCP)**: For different suffixes of the same repeated substring, measure the length of their longest common prefix.
### Conclusion
The core objective of the paper is to understand and utilize the performance of BWT when dealing with properties with contextual locality, so as to achieve efficient compression and retrieval. This not only contributes to theoretical research, but also has potential application value in fields such as bioinformatics.