Abstract:Deciphering information hidden in the gene expression assays for identifying disease subtypes has significant importance in precision medicine. However, computational limitations thwart this process due to the intricacy of the biological networks and the curse of dimensionality of gene expression data. Therefore, clustering in such scenarios often becomes the first choice of exploratory data analysis to identify natural structures and intrinsic patterns in the data. However, sparse and high dimensional nature of omics data prevents conventional clustering algorithms to discover subtypes that are clinically relevant and statistically significant. Hence, non-linear dimensionality reduction coupled with clustering in such scenarios often becomes imperative to improve the clustering results. In this study, we present a robust pipeline to discover disease subtypes with clinical relevance. Specifically, we focus on discovering patient sub-groups that have a residual life patterns remarkably different from other sub-groups. This is significant because by refining prognosis, subtyping can reduce uncertainty in approximating patients expected outcome. The methodology present is based on robust correlation estimation , UMAP – a non-linear dimensionality reduction method and mapper – a tool from topology. Notably, we suggest a method for improving the robustness of the correlation matrix of gene expression data for improving the clustering results. The performance of the model is evaluated by applying to five cancer datasets obtained through TCGA and comparisons are performed with some state of the art methods of NEMO, RSC-OTRI and SNF with regard to log−rank test and Restricted Life Expectancy Difference . For example in GBM dataset, the minimum separation for any two discovered subtypes is 221 days which is significantly higher than the other methodologies. We also compared the results without using the robust correlation based estimate and observed that robust correlation improves separability between survival curves significantly. From the results we infer that our methodology performs better compared to other methodologies with regard to separating survival curves of patient sub-groups despite using single omics profiles of patients compared to multiple omics profiles of SNF and NEMO. Pathway over-representation analysis is performed on the final clustering results to investigate the biological underpinnings characterising each subtype.

Robust clustering of noisy high-dimensional gene expression data for patients subtyping

Robust correlation estimation and UMAP assisted topological analysis of omics data for disease subtyping

Robust model-based clustering with gene ranking

Outcome-guided Bayesian clustering for disease subtype discovery using high-dimensional transcriptomic data

Clustering cancer gene expression data: a comparative study

UMAP guided topological analysis of transcriptomic data for cancer subtyping

Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data

Supervised Graph Clustering for Cancer Subtyping Based on Survival Analysis and Integration of Multi-Omic Tumor Data

COPS: A novel platform for multi-omic disease subtype discovery via robust multi-objective evaluation of clustering algorithms

Outcome-guided Sparse K-means for Disease Subtype Discovery via Integrating Phenotypic Data with High-dimensional Transcriptomic Data

Outcome-Guided Disease Subtyping for High-Dimensional Omics Data

Robust structured heterogeneity analysis approach for high-dimensional data

A Bayesian framework to study tumor subclone-specific expression by combining bulk DNA and single-cell RNA sequencing data

Clustering of Transcriptomic Data for the Identification of Cancer Subtypes

Cancer subtype identification by multi-omics clustering based on interpretable feature and latent subspace learning

A Contrastive-Learning-Based Deep Neural Network for Cancer Subtyping by Integrating Multi-Omics Data

Subtype-GAN: a deep learning approach for integrative cancer subtyping of multi-omics data

Cancer Gene Profiling through Unsupervised Discovery

Dissecting tumor transcriptional heterogeneity from single-cell RNA-seq data by generalized binary covariance decomposition

Sparse integrative clustering of multiple omics data sets

Learning vector quantized representation for cancer subtypes identification