A Matching Algorithm in PMWL Based on CluTree.

Yingling Liu,Xindong Wu,Xue-gang Hu,Jun Gao
DOI: https://doi.org/10.1007/s00354-014-0201-3
2014-01-01
New Generation Computing
Abstract:Pattern matching with wildcards and length constraints (PMWL) is a complex problem which has important applications in bioinformatics, network security and information retrieval. Existing algorithms use the traditional left-most strategy when selecting among multiple candidate matching positions, which leads to incomplete final matching results. This paper presents a new data structure CluTree and a new matching algorithm RBCT*1 based on CluTree. After establishing a cluster of trees with red and black nodes according to a pattern P and a text T, which is called CluTree, our RBCT algorithm uses the sharing degree, correlation degree and mixed information entropy of each node in the CluTree for path selection and dynamic pruning. Our RBCT algorithm traverses the CluTree and finds more occurrences compared to the existing algorithms under the one-off condition in a linear time cost. Theoretical analysis and experimental results show that the RBCT algorithm outperforms other peers in retrieval precision and matching efficiency.
What problem does this paper attempt to address?