Abstract:Genomic malformations are believed to be the driving factors of many diseases. Therefore, understanding the intrinsic mechanisms underlying the genome and informing clinical practices have become two important missions of large-scale genomic research. Recently, high-throughput molecular data have provided abundant information about the whole genome, and have popularized computational tools in genomics. However, traditional machine learning methodologies often suffer from strong limitations when dealing with high-throughput genomic data, because the latter are usually very high dimensional, highly heterogeneous, and can show complicated nonlinear effects. In this thesis, we present five new algorithms or models to address these challenges, each of which is applied to a specific genomic problem.Project 1 focuses on model selection in cancer diagnosis. We develop an efficient algorithm (ADMM-ENSVM) for the Elastic Net Support Vector Machine, which achieves simultaneous variable selection and max-margin classification. On a colon cancer diagnosis dataset, ADMM-ENSVM shows advantages over other SVM algorithms in terms of diagnostic accuracy, feature selection ability, and computational efficiency.Project 2 focuses on model selection in gene correlation analysis. We develop an efficient algorithm (SBLVGG) using the similar methodology as of ADMM-ENSVM for the Latent Variable Gaussian Graphical Model (LVGG). LVGG models the marginal concentration matrix of observed variables as a combination of a sparse matrix and a low rank one. Evaluated on a microarray dataset containing 6,316 genes, SBLVGG is notably faster than the state-of-the-art LVGG solver, and shows that most of the correlation among genes can be effectively explained by only tens of latent factors.Project 3 focuses on ensemble learning in cancer survival analysis. We develop a gradient boosting model (GBMCI), which does not explicitly assume particular forms of hazard functions, but trains an ensemble of regression trees to approximately optimize the concordance index. We benchmark the performance of GBMCI against several popular survival models on a large-scale breast cancer prognosis dataset. GBMCI consistently outperforms other methods based on a number of feature representations, which are heterogeneous and contain missing values. Project 4 focuses on deep learning in gene expression inference (GEIDN). GEIDN is a large-scale neural network, which can infer ~21k target genes jointly from ~1k landmark genes and can naturally capture hierarchical nonlinear interactions among genes. We deploy deep learning techniques (drop out, momentum training, GPU computing, etc.) to train GEIDN. On a dataset of ~129k complete human transcriptomes, GEIDN outperforms both k-nearest neighbor regression and linear regression in predicting >99.96% of the target genes. Moreover, increased network scales help to improve GEIDN, while increased training data benefits GEIDN more than other methods.Project 5 focuses on deep learning in annotating coding and noncoding genetic variants (DANN). DANN is a neural network to differentiate evolutionarily derived alleles from simulated ones with 949 highly heterogeneous features. It can capture nonlinear relationships among features. We train DANN with deep learning techniques like for GEIDN. DANN achieves a 18.90% relative reduction in the error rate and a 14.52% relative increase in the area under the curve over CADD, a state-of-the-art algorithm to annotate genetic variants based on the linear SVM.

Comparative Study of Ensemble Learning Approaches in the Identification of Disease Mutations

Inferring Non-Synonymous Single-Nucleotide Polymorphisms-Disease Associations Via Integration of Multiple Similarity Networks

Prioritisation of Candidate Single Amino Acid Polymorphisms Using One-Class Learning Machines.

Prioritization of Nonsynonymous Single Nucleotide Variants for Exome Sequencing Studies Via Integrative Learning on Multiple Genomic Data

Supervised Learning-Based Tagsnp Selection for Genome-Wide Disease Classifications

Identification of Disease-Related Nssnps Via the Integration of Protein Sequence Features and Domain-Domain Interaction Data.

Integrating sequence conservation features and a domain-domain interaction network to detect disease-associated nsSNPs

Sequence-Based Prioritization of Nonsynonymous Single-Nucleotide Polymorphisms for the Study of Disease Mutations

A Comparative Study of Ensemble Learning Approaches in the Classification of Breast Cancer Metastasis

Prediction of Deleterious Nonsynonymous Single-Nucleotide Polymorphism for Human Diseases

Nested genetic algorithm-based classifier selection and placement in multi-level ensemble framework for effective disease diagnosis

Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers

Framing youth issues for public support.

A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants

Prioritization of Candidate Nonsynonymous Single Nucleotide Polymorphisms via Sequence Conservation Features

Machine Learning for Large-Scale Genomics: Algorithms, Models and Applications

Prediction of Deleterious Single Amino Acid Polymorphisms with a Consensus Holdout Sampler

GCN-MF: Disease-Gene Association Identification By Graph Convolutional Networks and Matrix Factorization

Computational Approaches for Disease Gene Identification

Deleterious synonymous mutation identification based on selective ensemble strategy

Classification of genetic variants using machine learning