Forest Fire Clustering for Single-cell Sequencing with Iterative Label Propagation and Parallelized Monte Carlo Simulation

Zhanlin Chen,Jeremy Goldwasser,Philip Tuckman,Jason Liu,Jing Zhang,Mark Gerstein
DOI: https://doi.org/10.1038/s41467-022-31107-8
2022-05-26
Abstract:In the era of single-cell sequencing, there is a growing need to extract insights from data with clustering methods. Here, we introduce Forest Fire Clustering, an efficient and interpretable method for cell-type discovery from single-cell data. Forest Fire Clustering makes minimal prior assumptions and, different from current approaches, calculates a non-parametric posterior probability that each cell is assigned a cell-type label. These posterior distributions allow for the evaluation of a label confidence for each cell and enable the computation of "label entropies," highlighting transitions along developmental trajectories. Furthermore, we show that Forest Fire Clustering can make robust, inductive inferences in an online-learning context and can readily scale to millions of cells. Finally, we demonstrate that our method outperforms state-of-the-art clustering approaches on diverse benchmarks of simulated and experimental data. Overall, Forest Fire Clustering is a useful tool for rare cell type discovery in large-scale single-cell analysis.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve several key problems in single - cell sequencing data analysis: 1. **Discovery of rare cell types**: Single - cell sequencing data are usually very sparse and uncertain. Traditional clustering methods may overlook rare but important cell types due to strong assumptions about the data distribution. Therefore, a method that can minimize prior assumptions is required to identify these rare cell types. 2. **Quantification of label confidence**: In addition to classifying cells into different groups, it is also necessary to quantify the confidence of the cell type to which each cell belongs. This is crucial for verifying the effectiveness and reliability of computational analysis methods. 3. **Computational efficiency**: With the growth of the scale of single - cell RNA sequencing (scRNA - seq) data sets, a data set may contain millions of cells. Therefore, it becomes particularly important to design a scalable and efficient clustering algorithm. To solve these problems, the author introduced a new clustering method - Forest Fire Clustering. This method has the following characteristics: - **Minimizing prior assumptions**: Forest Fire Clustering performs label propagation by simulating the spread process of forest fires and only depends on a "fire temperature" hyper - parameter, thus reducing assumptions about the data shape and other features. - **Internal validation**: Through Monte Carlo simulation, a posterior label distribution can be constructed for each data point, and then the label confidence and information entropy can be calculated, providing internal validation of the clustering results. - **Computational efficiency**: This algorithm can run quickly on large - scale data sets and supports online learning. It can handle data streams accumulated over time without reclustering the entire data set. In conclusion, Forest Fire Clustering aims to provide an efficient, interpretable and reliable tool for discovering rare cell types from single - cell sequencing data and quantifying their label confidence.