Frequency-Constrained Substring Complexity
Solon P. Pissis,Michael Shekelyan,Chang Liu,Grigorios Loukides
DOI: https://doi.org/10.1007/978-3-031-43980-3_28
2023-01-01
Abstract:We introduce the notion of frequency-constrained substring complexity. For any finite string, it counts the distinct substrings of the string per length and frequency class. For a string x of length n and a partition of [n] in tau intervals, I = I-1, . . . , I-tau, the frequency-constrained substring complexity of x is the function f(x,I)(i, j) that maps i, j to the number of distinct substrings of length i of x occurring at least alpha(j) and at most beta(j) times in x, where I-j = [alpha(j), beta(j)]. We extend this notion as follows. For a string x, a dictionary D of d strings (documents), and a partition of [d] in tau intervals I-1, . . . , I-tau, we define a 2D array S = S[1 . . |x|, 1 . . tau] as follows: S[i, j] is the number of distinct substrings of length i of x occurring in at least alpha(j) and at most beta(j) documents, where I-j = [alpha(j), beta(j)]. Array S can thus be seen as the distribution of the substring complexity of x into tau document frequency classes. We show that after a linear-time preprocessing of D, for any x and any partition of [d] in tau intervals given online, array S can be computed in near-optimal O(|x|tau log log d) time.