NETASPNO: Approximate Strict Pattern Matching under Nonoverlapping Condition

Youxi Wu,Shasha Li,Jingyu Liu,Lei Guo,Xindong Wu
DOI: https://doi.org/10.1109/access.2018.2832209
IF: 3.9
2018-01-01
IEEE Access
Abstract:In pattern matching, a gap constraint is a more flexible wildcard than traditional wildcards "?" and "*". Pattern matching with gap constraints is more difficult to handle and fulfills users enquiries more easily. Pattern matching with gap constraints has therefore been carried out in numerous research works, such as music information retrieval, searching protein sites, and sequence pattern mining. Strict pattern matching under a nonoverlapping condition, as a type of pattern matching with gap constraints, is a key issue of sequence pattern mining with gap constraints since it can be used to compute the frequency of a pattern. Exact matching limits the flexibility of the match to some extent since it requires each character to be matched exactly. We therefore address approximate strict pattern matching under the nonoverlapping constraints (ASPNO) and propose an effective algorithm, named NETtree for ASPNO (NETASPNO), which first transforms the problem into a Nettree data structure, an extensive tree structure. To find the nonoverlapping occurrences effectively, we propose the concept of number of roots paths with distance constraints (NRPDC) which indicates the number of path from a node to the roots with distance d and can be used to delete useless parent-child relationships and useless nodes. We iteratively recalculate the NRPDCs of each node on the subnettree with the rightmost root. Then we can get a path from the rightmost leaf to its rightmost root without using the backtracking strategy. NETASPNO therefore iteratively gets the rightmost root-leaf-path and prunes the path on the Nettree. Extensive experimental results demonstrate that NETASPNO has better performance than the other competitive algorithms.
What problem does this paper attempt to address?