Efficient Pattern Matching in Elastic-Degenerate Strings

Costas Iliopoulos,Ritu Kundu,Solon Pissis
DOI: https://doi.org/10.48550/arXiv.1610.08111
2016-10-26
Abstract:In this paper, we extend the notion of gapped strings to elastic-degenerate strings. An elastic-degenerate string can been seen as an ordered collection of k > 1 seeds (substrings/subpatterns) interleaved by elastic-degenerate symbols such that each elastic-degenerate symbol corresponds to a set of two or more variable length strings. Here, we present an algorithm for solving the pattern matching problem with (solid) pattern and elastic-degenerate text, running in O(N+{\alpha}{\gamma}nm) time; where m is the length of the given pattern; n and N are the length and total size of the given elastic-degenerate text, respectively; {\alpha} and {\gamma} are small constants, respectively representing the maximum number of strings in any elastic-degenerate symbol of the text and the largest number of elastic-degenerate symbols spanned by any occurrence of the pattern in the text. The space used by the algorithm is linear in the size of the input for a constant number of elastic-degenerate symbols in the text; {\alpha} and {\gamma} are so small in real applications that the algorithm is expected to work very efficiently in practice.
Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the problem of pattern matching in elastic - degenerate strings. Specifically, given a solid pattern and an elastic - degenerate text, the goal is to find all occurrences of the pattern in the text. ### Background and Motivation In fields such as genomics, processing sequence data containing uncertainties is an important problem. Traditional linear reference genomes cannot fully capture the complexity at the population level, so new representation methods are required to better describe these uncertainties. Elastic - degenerate strings are one such representation method, which combines the concepts of gapped patterns and degenerate strings. ### Definition of Elastic - Degenerate Strings - **Seed**: A seed \(S\) is a possibly empty string, that is, \(S\in\Sigma^*\). - **Elastic - Degenerate Symbol**: An elastic - degenerate symbol \(\xi\) is a non - empty set of strings, that is, \(\xi\subset\Sigma^*\) and \(\xi\neq\emptyset\). Each \(\xi\) can be represented as: \[ \xi = \begin{bmatrix} E_1\\ E_2\\ \vdots\\ E_{|\xi|} \end{bmatrix} \] where each \(E_i\) is a string of a fixed length. - **Elastic - Degenerate String**: An elastic - degenerate string \(\hat{X}\) is a sequence composed of seeds and elastic - degenerate symbols, in the form of: \[ \hat{X}=S_1\xi_1S_2\xi_2S_3\ldots\xi_{k - 1}S_k \] where \(S_i\) are seeds and \(\xi_i\) are elastic - degenerate symbols. ### Problem Definition Given a pattern \(P\) of length \(m\) and an elastic - degenerate text \(\hat{T}\) of length \(n\) and total size \(N\), the goal is to find all occurrences of the pattern \(P\) in the text \(\hat{T}\). ### Algorithm Overview The paper proposes an efficient algorithm with a time complexity of \(O(N+\alpha\gamma nm)\), where: - \(m\) is the length of the pattern. - \(n\) and \(N\) are the length and total size of the elastic - degenerate text respectively. - \(\alpha\) is the maximum number of strings in any elastic - degenerate symbol in the text. - \(\gamma\) is the maximum number of elastic - degenerate symbols spanned when the pattern occurs in the text. ### Algorithm Steps 1. **Pre - processing Phase**: - Calculate the failure function of the pattern \(P\) for the KMP algorithm. - Build the generalized suffix tree for each seed \(S_i\) and each elastic - degenerate symbol \(\xi_i\), and pre - process these trees to support constant - time least common ancestor (LCA) queries. 2. **Search Phase**: - Use the KMP algorithm to search for the pattern in the text, handling seeds and elastic - degenerate symbols. - Process the type 1 (starting from a fixed position) and type 2 (starting from an elastic - degenerate position) occurrences separately. - Use LCA queries to extend the marked tails and recursively check the subsequent seeds and elastic - degenerate symbols.