Abstract:Deciphering information hidden in the gene expression assays for identifying disease subtypes has significant importance in precision medicine. However, computational limitations thwart this process due to the intricacy of the biological networks and the curse of dimensionality of gene expression data. Therefore, clustering in such scenarios often becomes the first choice of exploratory data analysis to identify natural structures and intrinsic patterns in the data. However, sparse and high dimensional nature of omics data prevents conventional clustering algorithms to discover subtypes that are clinically relevant and statistically significant. Hence, non-linear dimensionality reduction coupled with clustering in such scenarios often becomes imperative to improve the clustering results. In this study, we present a robust pipeline to discover disease subtypes with clinical relevance. Specifically, we focus on discovering patient sub-groups that have a residual life patterns remarkably different from other sub-groups. This is significant because by refining prognosis, subtyping can reduce uncertainty in approximating patients expected outcome. The methodology present is based on robust correlation estimation , UMAP – a non-linear dimensionality reduction method and mapper – a tool from topology. Notably, we suggest a method for improving the robustness of the correlation matrix of gene expression data for improving the clustering results. The performance of the model is evaluated by applying to five cancer datasets obtained through TCGA and comparisons are performed with some state of the art methods of NEMO, RSC-OTRI and SNF with regard to log−rank test and Restricted Life Expectancy Difference . For example in GBM dataset, the minimum separation for any two discovered subtypes is 221 days which is significantly higher than the other methodologies. We also compared the results without using the robust correlation based estimate and observed that robust correlation improves separability between survival curves significantly. From the results we infer that our methodology performs better compared to other methodologies with regard to separating survival curves of patient sub-groups despite using single omics profiles of patients compared to multiple omics profiles of SNF and NEMO. Pathway over-representation analysis is performed on the final clustering results to investigate the biological underpinnings characterising each subtype.

Outcome-guided Bayesian clustering for disease subtype discovery using high-dimensional transcriptomic data

Outcome-guided Sparse K-means for Disease Subtype Discovery via Integrating Phenotypic Data with High-dimensional Transcriptomic Data

Outcome-Guided Disease Subtyping for High-Dimensional Omics Data

[Changes in milk fat consumption and infarction mortality during the 1970s in Finland].

Robust clustering of noisy high-dimensional gene expression data for patients subtyping

Robust correlation estimation and UMAP assisted topological analysis of omics data for disease subtyping

A Bayesian framework to study tumor subclone-specific expression by combining bulk DNA and single-cell RNA sequencing data

Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data

Hierarchical Bayesian Clustering Design of Multiple Biomarker Subgroups (HCOMBS)

A clustering approach to integrative analyses of multiomic cancer data

A Clustering Approach to Integrative Analysis of Multiomic Cancer Data

Clustering of Transcriptomic Data for the Identification of Cancer Subtypes

Bayesian outcome-guided multi-view mixture models with applications in molecular precision medicine

Supervised clustering of high-dimensional data using regularized mixture modeling

Subtype-DCC: decoupled contrastive clustering method for cancer subtype identification based on multi-omics data

Gene-SGAN: discovering disease subtypes with imaging and genetic signatures via multi-view weakly-supervised deep clustering

Gene-SGAN: a method for discovering disease subtypes with imaging and genetic signatures via multi-view weakly-supervised deep clustering

Subtype-GAN: a deep learning approach for integrative cancer subtyping of multi-omics data

Bayesian network-driven clustering analysis with feature selection for high-dimensional multi-modal molecular data

A Bayesian hierarchical hidden Markov model for clustering and gene selection: Application to kidney cancer gene expression data

Model-based clustering for identifying disease-associated SNPs in case-control genome-wide association studies