Abstract:Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used $O(r)$ space and was able to efficiently count the number of occurrences of a pattern of length $m$ in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of $r$. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the $occ$ occurrences efficiently within $O(r)$ space (in loglogarithmic time each), and reaching optimal time $O(m+occ)$ within $O(r\log(n/r))$ space, on a RAM machine of $w=\Omega(\log n)$ bits. Within $O(r\log (n/r))$ space, our index can also count in optimal time $O(m)$. Raising the space to $O(r w\log_\sigma(n/r))$, we support count and locate in $O(m\log(\sigma)/w)$ and $O(m\log(\sigma)/w+occ)$ time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using $O(r\log(n/r))$ space that replaces the text and extracts any text substring of length $\ell$ in almost-optimal time $O(\log(n/r)+\ell\log(\sigma)/w)$. (...continues...)

Bass: Approximate Search on Large String Databases

Design of Fast Multiple String Searching Based on Improved Prefix Tree

INSTRUCT: Space-Efficient Structure for Indexing and Complete Query Management of String Databases

Efficient Metric Indexing for Similarity Search

Optimal-Time Text Indexing in BWT-runs Bounded Space

Augmented Keyword Search on Spatial Entity Databases

A new problem in string searching

Two simple full-text indexes based on the suffix array

Text Indexing for Long Patterns using Locally Consistent Anchors

Processing Long Queries Against Short Text

Optimizing substructure search: a novel approach for efficient querying in large chemical databases

High-performance String Matching Algorithm for Large Scale String Set

Efficient Similarity Search for Tree-Structured Data

LITS: An Optimized Learned Index for Strings

LITS: An Optimized Learned Index for Strings (An Extended Version)

Effective Indices for Efficient Approximate String Search and Similarity Join

DRESS: dimensionality reduction for efficient sequence search

Bounding the Last Mile: Efficient Learned String Indexing

Assessment of approximate string matching in a biomedical text retrieval problem

Robust Quick String Matching Algorithm for Network Security

A Collaborative Retrieval System - Full Text Base and Database