A Parametric Model for Clustering Single-Cell Mutation Data

Jiaqian Yan,Jianing Xi,Zhenhua Yu
DOI: https://doi.org/10.1109/bibm55620.2022.9995308
2022-01-01
Abstract:Clustering tumor single-cell mutation data has formed an important paradigm for deciphering tumor subclones and evolutionary history. This type of data may often be heavily complicated by incompleteness, false positives and false negatives errors. Despite to the fact that several computational methods have been developed for clustering binary mutation data, their applications still suffer from degraded accuracy on large datasets or datasets with high sparsity. Therefore, more effective methods are sorely required. Here, we propose a novel method called CBM for reliably Clustering Binary Mutation data. CBM formulates the binary mutation data under a probabilistic framework through parameterizing false positive errors, false negative errors, presence probability distribution of subclones and their binary mutation profiles. To cope with the difficulty of optimizing discrete parameters, Gibbs sampling for mixtures is employed to iteratively sample cell-to-cluster assignments and cluster centers from the posterior. Extensive evaluations on simulated and real datasets demonstrate CBM outperforms the state-of-the-art tools in different performance metrics such as ARI for clustering and accuracy for genotyping. CBM can be integrated into the pipeline of reconstructing tumor evolutionary tree, and detecting subclones using CBM can be employed as a pre-text task of tumor subclonal tree inference, which will significantly improve computational efficiency of phylogenetic analysis especially on large datasets. CBM software is freely available at https://github.com/zhyu-lab/cbm.
What problem does this paper attempt to address?