Abstract:I present efficient data mining algorithms for knowledge discovery on two types of emerging large-scale sequence-based scientific datasets: (1) static sequence data generated from SNP diversity arrays for genomic studies, and (2) dynamic sequence data collected in streaming and sensor network systems for environmental studies. The massive, noisy nature of the SNP arrays and the distributive, online nature of sensor network data pose challenging issues for knowledge discovery such as scalability, robustness, and efficiency. Despite the different characteristics of the SNP arrays and streaming sensor data, when viewed as sequences of ordered observations, both can be efficiently mined using algorithms based on block-wise decomposition methods. I present models and mining algorithms for inferring the genetic variation structure in genome-wide Single-Nucleotide Polymorphism (SNP) arrays. Genome-wide SNP arrays provide a comprehensive view of genome variation and serve as powerful resources for genetic and biomedical studies. Understanding the patterns of genetic variation in a population of individuals plays an important role in solving many genetics problems such as genealogy reconstruction and gene association studies. In this thesis, I propose data mining models and algorithms to efficiently infer genetic variation structure from the massive SNP panels of recombinant sequences resulting from meiotic recombination. I introduced the Minimum Segmentation Problem (MSP) to infer the segmentation structure of a single recombinant strain, as well as the Minimum Mosaic Problem (MMP) to infer the mosaic structure on a panel of recombinant strains. Both MSP and MMP estimate the ancestral polymorphism patterns exhibited in recombinant strains which provides important inputs for the subsequent association analysis. Efficient dynamic programming and graph algorithms based on block-wise decomposition are proposed which can solve MSP and MMP on genome-wide large-scale panels. I present efficient algorithms for mining massive streaming and sensor network data for observational sciences such as ecology and environmental studies. I proposed efficient algoirithms using block-wise synopsis construction to capture the data distribution online for the dynamic sequence data collected in the sensor network and streaming systems including clustering analysis and order-statistics computation, which is critical for real-time monitoring, anomaly detection, and other domain specific analysis.

Efficient Algorithms for Mining Maximal Frequent Concatenate Sequences in Biological Datasets

A Fast Exact Pattern Matching Algorithm for Biological Sequences

A Fast Improved Pattern Matching Algorithm for Biological Sequences

Mining Sequential Patterns by Pattern-Growth: the PrefixSpan Approach.

Efficient Algorithms for Finding a Longest Common Increasing Subsequence

Accelerated Frequent Closed Sequential Pattern Mining for Uncertain Data

Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining

CuMen: Clustering Sequences Based on Maximal Frequent Sequential Pattern and its Application in Genome Sequence Assembly

Efficient Discovery of Periodic Patterns over Event Sequences

Efficient Mining of Gap-Constrained Subsequences and Its Various Applications

HANP-Miner: High average utility nonoverlapping sequential pattern mining

An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming

Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases

Mining emerging massive scientific sequence data using block-wise decomposition methods

Discovering Local Patterns From Multiple Temporal Sequences

Mining long sequential patterns in a noisy environment.

Efficient Mining of Frequent Sequence Generators

NetNMSP: Nonoverlapping maximal sequential pattern mining

Discovering Periodic Patterns Common to Multiple Sequences

BIDE: Efficient Mining of Frequent Closed Sequences

A Fast Longest Common Subsequence Algorithm for Biosequences Alignment