Abstract:Deciphering information hidden in the gene expression assays for identifying disease subtypes has significant importance in precision medicine. However, computational limitations thwart this process due to the intricacy of the biological networks and the curse of dimensionality of gene expression data. Therefore, clustering in such scenarios often becomes the first choice of exploratory data analysis to identify natural structures and intrinsic patterns in the data. However, sparse and high dimensional nature of omics data prevents conventional clustering algorithms to discover subtypes that are clinically relevant and statistically significant. Hence, non-linear dimensionality reduction coupled with clustering in such scenarios often becomes imperative to improve the clustering results. In this study, we present a robust pipeline to discover disease subtypes with clinical relevance. Specifically, we focus on discovering patient sub-groups that have a residual life patterns remarkably different from other sub-groups. This is significant because by refining prognosis, subtyping can reduce uncertainty in approximating patients expected outcome. The methodology present is based on robust correlation estimation , UMAP – a non-linear dimensionality reduction method and mapper – a tool from topology. Notably, we suggest a method for improving the robustness of the correlation matrix of gene expression data for improving the clustering results. The performance of the model is evaluated by applying to five cancer datasets obtained through TCGA and comparisons are performed with some state of the art methods of NEMO, RSC-OTRI and SNF with regard to log−rank test and Restricted Life Expectancy Difference . For example in GBM dataset, the minimum separation for any two discovered subtypes is 221 days which is significantly higher than the other methodologies. We also compared the results without using the robust correlation based estimate and observed that robust correlation improves separability between survival curves significantly. From the results we infer that our methodology performs better compared to other methodologies with regard to separating survival curves of patient sub-groups despite using single omics profiles of patients compared to multiple omics profiles of SNF and NEMO. Pathway over-representation analysis is performed on the final clustering results to investigate the biological underpinnings characterising each subtype.

ESTIMATING THE NUMBER OF CANCER SUBTYPES FROM WHOLE-GENOME EXPRESSION DATA VIA A PENALIZED PROBABILISTIC PRINCIPAL COMPONENT ANALYSIS ∗ By

A Robust Statistical Procedure to Discover Expression Biomarkers Using Microarray Genomic Expression Data.

Multi-Omics Data Fusion via a Joint Kernel Learning Model for Cancer Subtype Discovery and Essential Gene Identification

Integrative Analysis of Prognosis Data on Multiple Cancer Subtypes using Penalization

Dissecting tumor transcriptional heterogeneity from single-cell RNA-seq data by generalized binary covariance decomposition

Cancer prediction with gene expression profiling and differential evolution

A unified computational model for revealing and predicting subtle subtypes of cancers

Automated Cancer Subtyping via Vector Quantization Mutual Information Maximization

Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data

Cancer Classification Using Entropy Analysis in Fractional Fourier Domain of Gene Expression Profile

Sparse integrative clustering of multiple omics data sets

Dynamic Meta-data Network Sparse PCA for Cancer Subtype Biomarker Screening

Unravelling the hidden heterogeneities of diffuse large B-cell lymphoma based on coupled two-way clustering

Exploring Dimension Learning Via a Penalized Probabilistic Principal Component Analysis

Deep-Learning-Based Cancer Profiles Classification Using Gene Expression Data Profile

CAsubtype: an R Package to Identify Gene Sets Predictive of Cancer Subtypes and Clinical Outcomes

Feature (gene) Selection in Gene Expression-Based Tumor Classification

Robust correlation estimation and UMAP assisted topological analysis of omics data for disease subtyping

Integrating Biological Knowledge with Gene Expression Profiles for Survival Prediction of Cancer

GENE EXPRESSION DATA ANALYSIS IN SUBTYPES OF OVARIAN CANCER USING COVARIANCE ANALYSIS

PINCAGE: probabilistic integration of cancer genomics data for perturbed gene identification and sample classification