Forest Fire Clustering for Single-cell Sequencing with Iterative Label Propagation and Parallelized Monte Carlo Simulation

Zhanlin Chen,Jeremy Goldwasser,Philip Tuckman,Jason Liu,Jing Zhang,Mark Gerstein

DOI: https://doi.org/10.1038/s41467-022-31107-8

2022-05-26

Abstract:In the era of single-cell sequencing, there is a growing need to extract insights from data with clustering methods. Here, we introduce Forest Fire Clustering, an efficient and interpretable method for cell-type discovery from single-cell data. Forest Fire Clustering makes minimal prior assumptions and, different from current approaches, calculates a non-parametric posterior probability that each cell is assigned a cell-type label. These posterior distributions allow for the evaluation of a label confidence for each cell and enable the computation of "label entropies," highlighting transitions along developmental trajectories. Furthermore, we show that Forest Fire Clustering can make robust, inductive inferences in an online-learning context and can readily scale to millions of cells. Finally, we demonstrate that our method outperforms state-of-the-art clustering approaches on diverse benchmarks of simulated and experimental data. Overall, Forest Fire Clustering is a useful tool for rare cell type discovery in large-scale single-cell analysis.

Machine Learning

What problem does this paper attempt to address?

This paper attempts to solve several key problems in single - cell sequencing data analysis: 1. **Discovery of rare cell types**: Single - cell sequencing data are usually very sparse and uncertain. Traditional clustering methods may overlook rare but important cell types due to strong assumptions about the data distribution. Therefore, a method that can minimize prior assumptions is required to identify these rare cell types. 2. **Quantification of label confidence**: In addition to classifying cells into different groups, it is also necessary to quantify the confidence of the cell type to which each cell belongs. This is crucial for verifying the effectiveness and reliability of computational analysis methods. 3. **Computational efficiency**: With the growth of the scale of single - cell RNA sequencing (scRNA - seq) data sets, a data set may contain millions of cells. Therefore, it becomes particularly important to design a scalable and efficient clustering algorithm. To solve these problems, the author introduced a new clustering method - Forest Fire Clustering. This method has the following characteristics: - **Minimizing prior assumptions**: Forest Fire Clustering performs label propagation by simulating the spread process of forest fires and only depends on a "fire temperature" hyper - parameter, thus reducing assumptions about the data shape and other features. - **Internal validation**: Through Monte Carlo simulation, a posterior label distribution can be constructed for each data point, and then the label confidence and information entropy can be calculated, providing internal validation of the clustering results. - **Computational efficiency**: This algorithm can run quickly on large - scale data sets and supports online learning. It can handle data streams accumulated over time without reclustering the entire data set. In conclusion, Forest Fire Clustering aims to provide an efficient, interpretable and reliable tool for discovering rare cell types from single - cell sequencing data and quantifying their label confidence.

Forest Fire Clustering for Single-cell Sequencing with Iterative Label Propagation and Parallelized Monte Carlo Simulation

Scart: Recognizing Cell Clusters and Constructing Trajectory from Single-Cell Epigenomic Data

Single-Cell Transcriptome Data Clustering via Multinomial Modeling and Adaptive Fuzzy K-Means Algorithm

Scalable nonparametric clustering with unified marker gene selection for single-cell RNA-seq data

A Cell Marker-Based Clustering Strategy (cmcluster) for Precise Cell Type Identification of Scrna-Seq Data

Simultaneous Deep Generative Modelling and Clustering of Single-Cell Genomic Data

Parallel Clustering of Single Cell Transcriptomic Data with Split-Merge Sampling on Dirichlet Process Mixtures

Clustering single-cell RNA-seq data with a model-based deep learning approach

scASDC: Attention Enhanced Structural Deep Clustering for Single-cell RNA-seq Data

Discovery of optimal cell type classification marker genes from single cell RNA sequencing data

SSCC: a novel computational framework for rapid and accurate clustering large single cell RNA-seq data

scAIDE: clustering of large-scale single-cell RNA-seq data reveals putative and rare cell types

A Hybrid Deep Clustering Approach for Robust Cell Type Profiling Using Single-cell RNA-seq Data

BREM-SC: A Bayesian Random Effects Mixture Model for Joint Clustering Single Cell Multi-omics Data

DIMM-SC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data

SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data

Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data

Deep Learning for clustering single-cell RNA-seq Data

Advancing single-cell RNA-seq data analysis through the fusion of multi-layer perceptron and graph neural network

scDFC: A deep fusion clustering method for single-cell RNA-seq data

Clustering of single-cell multi-omics data with a multimodal deep learning method