Abstract:SIAM Journal on Computing, Volume 53, Issue 5, Page 1524-1577, October 2024. We consider several types of internal queries, that is, questions about fragments of a given text [math] specified in constant space by their locations in [math]. Our main result is an optimal data structure for internal pattern matching (IPM) queries, which, given two fragments [math] and [math], ask for a representation of all fragments contained in [math] and matching [math] exactly. This problem can be viewed as an internal version of the fundamental exact pattern matching problem: We are looking for exact occurrences of one substring of [math] within another substring of [math]. Our data structure answers IPM queries in time proportional to the quotient [math] of the fragments' lengths, which is required due to the worst-case information content of the output. If [math] is a text of length [math] over an integer alphabet of size [math], then our data structure occupies [math] machine words (that is, [math] bits) and admits an [math]-time construction algorithm. We also show how to use IPM queries for answering internal queries corresponding to other classic string processing problems. Among others, we derive optimal data structures reporting the periods of a fragment and testing the cyclic equivalence of two fragments. Since the publication of the conference version of this paper [Kociumaka et al., Internal pattern matching queries in a text and applications, SODA 2015], IPM queries have found numerous further applications, following the path paved by the classic longest common extension (LCE) queries of Landau and Vishkin [J. Comput. System Sci., 37 (1988), pp. 63–78]. In particular, IPM queries have been implemented in grammar-compressed and dynamic settings and, along with LCE queries, constitute elementary operations of the [math] model, developed by Charalampopoulos, Kociumaka, and Wellnitz [Faster approximate pattern matching: A unified approach, FOCS 2020] to design approximate pattern matching algorithms that work in multiple settings. All our algorithms are deterministic, whereas the data structure in the conference version of the paper only admits a randomized construction in [math] expected time. To achieve this, we provide a novel construction of string synchronizing sets of Kempa and Kociumaka [String synchronizing sets: Sublinear-time BWT construction and optimal LCE data structure, STOC 2019]. Our method, based on a new restricted version of the recompression technique of Jeż [J. ACM, 63 (2016), pp. 4:1–4:51], yields a hierarchy of [math] string synchronizing sets covering the whole spectrum of the fragments' lengths.

Compressed Indexing for Consecutive Occurrences

Gapped String Indexing in Subquadratic Space and Sublinear Query Time

Sorted Consecutive Occurrence Queries in Substrings

The CDAWG Index and Pattern Matching on Grammar-Compressed Strings

Constant-delay enumeration for SLP-compressed documents

Optimal-Time Text Indexing in BWT-runs Bounded Space

Deterministic Indexing for Packed Strings

Text Indexing for Long Patterns using Locally Consistent Anchors

String Indexing for Patterns with Wildcards

Fingerprints in Compressed Strings

Internal Pattern Matching Queries in a Text and Applications

Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

Space-Efficient Indexes for Uncertain Strings

Linear Time Construction of Cover Suffix Tree and Applications

Generalized Straight-Line Programs

FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns

Rank, select and access in grammar-compressed strings

Near-Optimal Search Time in $δ$-Optimal Space, and Vice Versa

Space-efficient SLP Encoding for $O(\log N)$-time Random Access

Iterated Straight-Line Programs

Partial Data Compression and Text Indexing via Optimal Suffix Multi-Selection