Abstract:Motivation: One of the most important research areas in personalized medicine is the discovery of disease sub-types with relevance in clinical applications. This is usually accomplished by exploring gene expression data with unsupervised clustering methodologies. Then, with the advent of multiple omics technologies, data integration methodologies have been further developed to obtain better performances in patient separability. However, these methods do not guarantee the survival separability of the patients in different clusters.Results: We propose a new methodology that first computes a robust and sparse correlation matrix of the genes, then decomposes it and projects the patient data onto the first m spectral components of the correlation matrix. After that, a robust and adaptive to noise clustering algorithm is applied. The clustering is set up to optimize the separation between survival curves estimated cluster-wise. The method is able to identify clusters that have different omics signatures and also statistically significant differences in survival time. The proposed methodology is tested on five cancer datasets downloaded from The Cancer Genome Atlas repository. The proposed method is compared with the Similarity Network Fusion (SNF) approach, and model based clustering based on Student's t-distribution (TMIX). Our method obtains a better performance in terms of survival separability, even if it uses a single gene expression view compared to the multi-view approach of the SNF method. Finally, a pathway based analysis is accomplished to highlight the biological processes that differentiate the obtained patient groups.Availability and implementation: Our R source code is available online at https://github.com/angy89/RobustClusteringPatientSubtyping.Supplementary information: Supplementary data are available at Bioinformatics online.

Robust model-based clustering with gene ranking

Robust clustering of noisy high-dimensional gene expression data for patients subtyping

Robust Bayesian clustering for replicated gene expression data.

Clustering cancer gene expression data: a comparative study

Analysis of a Gibbs sampler method for model based clustering of gene expression data

A Kernel-Based Clustering Method for Gene Selection with Gene Expression Data.

Gamma-based clustering via ordered means with application to gene-expression analysis

Bayesian clustering of replicated time-course gene expression data with weak signals

Nonparametric clustering of RNA-sequencing data

Gen-Cluster: an Efficient Gene Expression Data High Dimensional Clustering Algorithm

A Bayesian hierarchical hidden Markov model for clustering and gene selection: Application to kidney cancer gene expression data

An Analysis of Gene Expression Data using Penalized Fuzzy C-Means Approach

Rough-fuzzy clustering for grouping functionally similar genes from microarray data

Modelling and Clustering of Gene Expressions Using RBFs and a Shape Similarity Metric

Tendency based Subspace Clustering on Gene Expression Data

On Gene Selection and Classification for Cancer Microarray Data Using Multi-Step Clustering and Sparse Representation

Modeling and Analysis of Gene Expression Time-Series Based on Co-Expression.

A Novel Approach for Single Gene Selection Using Clustering and Dimensionality Reduction

Gene Selection for Cancer Clustering Analysis Based on Expression Data

A Comparative Study of Novel Robust Clustering Algorithms

A sparse negative binomial mixture model for clustering RNA-seq count data