Abstract:Deciphering information hidden in the gene expression assays for identifying disease subtypes has significant importance in precision medicine. However, computational limitations thwart this process due to the intricacy of the biological networks and the curse of dimensionality of gene expression data. Therefore, clustering in such scenarios often becomes the first choice of exploratory data analysis to identify natural structures and intrinsic patterns in the data. However, sparse and high dimensional nature of omics data prevents conventional clustering algorithms to discover subtypes that are clinically relevant and statistically significant. Hence, non-linear dimensionality reduction coupled with clustering in such scenarios often becomes imperative to improve the clustering results. In this study, we present a robust pipeline to discover disease subtypes with clinical relevance. Specifically, we focus on discovering patient sub-groups that have a residual life patterns remarkably different from other sub-groups. This is significant because by refining prognosis, subtyping can reduce uncertainty in approximating patients expected outcome. The methodology present is based on robust correlation estimation , UMAP – a non-linear dimensionality reduction method and mapper – a tool from topology. Notably, we suggest a method for improving the robustness of the correlation matrix of gene expression data for improving the clustering results. The performance of the model is evaluated by applying to five cancer datasets obtained through TCGA and comparisons are performed with some state of the art methods of NEMO, RSC-OTRI and SNF with regard to log−rank test and Restricted Life Expectancy Difference . For example in GBM dataset, the minimum separation for any two discovered subtypes is 221 days which is significantly higher than the other methodologies. We also compared the results without using the robust correlation based estimate and observed that robust correlation improves separability between survival curves significantly. From the results we infer that our methodology performs better compared to other methodologies with regard to separating survival curves of patient sub-groups despite using single omics profiles of patients compared to multiple omics profiles of SNF and NEMO. Pathway over-representation analysis is performed on the final clustering results to investigate the biological underpinnings characterising each subtype.

Outcome-Guided Disease Subtyping for High-Dimensional Omics Data

Outcome-guided Sparse K-means for Disease Subtype Discovery via Integrating Phenotypic Data with High-dimensional Transcriptomic Data

Outcome-guided Bayesian clustering for disease subtype discovery using high-dimensional transcriptomic data

Clinical outcome-guided deep temporal clustering for disease progression subtyping

Robust correlation estimation and UMAP assisted topological analysis of omics data for disease subtyping

Unraveling the hidden heterogeneities of breast cancer based on functional miRNA cluster.

Subtype Dependent Biomarker Identification and Tumor Classification from Gene Expression Profiles.

Multi-view singular value decomposition for disease subtyping and genetic associations

Subtype-Former: a deep learning approach for cancer subtype discovery with multi-omics data

Subtype Classification and Heterogeneous Prognosis Model Construction in Precision Medicine

A unified computational model for revealing and predicting subtle subtypes of cancers

Unravelling the hidden heterogeneities of diffuse large B-cell lymphoma based on coupled two-way clustering

Cancer Subtyping via Embedded Unsupervised Learning on Transcriptomics Data

Subtype-DCC: decoupled contrastive clustering method for cancer subtype identification based on multi-omics data

Robust clustering of noisy high-dimensional gene expression data for patients subtyping

Supervised Graph Clustering for Cancer Subtyping Based on Survival Analysis and Integration of Multi-Omic Tumor Data

Lung squamous cell carcinoma subtyping and feature identification based on-omics data analysis

Subclassification of lung adenocarcinoma through comprehensive multi-omics data to benefit survival outcomes

Subtype-WGME enables whole-genome-wide multi-omics cancer subtyping

Robust Analysis of Cancer Heterogeneity for High‐dimensional Data

Molecular Subtyping of Cancer Based on Distinguishing Co-Expression Modules and Machine Learning