Abstract:I present efficient data mining algorithms for knowledge discovery on two types of emerging large-scale sequence-based scientific datasets: (1) static sequence data generated from SNP diversity arrays for genomic studies, and (2) dynamic sequence data collected in streaming and sensor network systems for environmental studies. The massive, noisy nature of the SNP arrays and the distributive, online nature of sensor network data pose challenging issues for knowledge discovery such as scalability, robustness, and efficiency. Despite the different characteristics of the SNP arrays and streaming sensor data, when viewed as sequences of ordered observations, both can be efficiently mined using algorithms based on block-wise decomposition methods. I present models and mining algorithms for inferring the genetic variation structure in genome-wide Single-Nucleotide Polymorphism (SNP) arrays. Genome-wide SNP arrays provide a comprehensive view of genome variation and serve as powerful resources for genetic and biomedical studies. Understanding the patterns of genetic variation in a population of individuals plays an important role in solving many genetics problems such as genealogy reconstruction and gene association studies. In this thesis, I propose data mining models and algorithms to efficiently infer genetic variation structure from the massive SNP panels of recombinant sequences resulting from meiotic recombination. I introduced the Minimum Segmentation Problem (MSP) to infer the segmentation structure of a single recombinant strain, as well as the Minimum Mosaic Problem (MMP) to infer the mosaic structure on a panel of recombinant strains. Both MSP and MMP estimate the ancestral polymorphism patterns exhibited in recombinant strains which provides important inputs for the subsequent association analysis. Efficient dynamic programming and graph algorithms based on block-wise decomposition are proposed which can solve MSP and MMP on genome-wide large-scale panels. I present efficient algorithms for mining massive streaming and sensor network data for observational sciences such as ecology and environmental studies. I proposed efficient algoirithms using block-wise synopsis construction to capture the data distribution online for the dynamic sequence data collected in the sensor network and streaming systems including clustering analysis and order-statistics computation, which is critical for real-time monitoring, anomaly detection, and other domain specific analysis.

A Scalable Data Mining Architecture for Bioinformation

A novel agent-based parallel ETL system for massive data

Mining emerging massive scientific sequence data using block-wise decomposition methods

Biomedr: An R/Cran Package For Integrated Data Analysis Pipeline In Biomedical Study

Big data analytics in bioinformatics: architectures, techniques, tools and issues

A hierarchical distributed data mining architecture

Intelligent mining of large-scale bio-data: Bioinformatics applications

Bio-medical Big Data Operating System (Bio-OS): An Integrated Data Mining Environment for Data Intensive Scientific Research

Survey of Biodata Analysis from a Data Mining Perspective

Data mining techniques for microarray datasets

A Distributed Parallel Computing Environment for Bioinformatics Problems

Deep Learning in Mining Biological Data

A Survey of Data Mining and Deep Learning in Bioinformatics

BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches

Research and Implementation of Interactive Analysis and Mining Technology for Big Data.

A Distributed Text Mining System for Online Web Textual Data Analysis

Bioinformatics software development: Principles and future directions

BioInformatics Agent (BIA): Unleashing the Power of Large Language Models to Reshape Bioinformatics Workflow

Closed-loop Big Data Analysis with Visualization and Scalable Computing

Biopipe: A Flexible Framework for Protocol-Based Bioinformatics Analysis

Big Data Analytics in Bioinformatics: A Machine Learning Perspective