GMM-Demux: sample demultiplexing, multiplet detection, experiment planning, and novel cell-type verification in single cell sequencing

Hongyi Xin,Qiuyu Lian,Yale Jiang,Jiadi Luo,Xinjun Wang,Carla Erb,Zhongli Xu,Xiaoyi Zhang,Elisa Heidrich-O’Hare,Qi Yan,Richard H. Duerr,Kong Chen,Wei Chen
DOI: https://doi.org/10.1186/s13059-020-02084-2
IF: 17.906
2020-07-30
Genome Biology
Abstract:Abstract Identifying and removing multiplets are essential to improving the scalability and the reliability of single cell RNA sequencing (scRNA-seq). Multiplets create artificial cell types in the dataset. We propose a Gaussian mixture model-based multiplet identification method, GMM-Demux. GMM-Demux accurately identifies and removes multiplets through sample barcoding, including cell hashing and MULTI-seq. GMM-Demux uses a droplet formation model to authenticate putative cell types discovered from a scRNA-seq dataset. We generate two in-house cell-hashing datasets and compared GMM-Demux against three state-of-the-art sample barcoding classifiers. We show that GMM-Demux is stable and highly accurate and recognizes 9 multiplet-induced fake cell types in a PBMC dataset.
genetics & heredity,biotechnology & applied microbiology
What problem does this paper attempt to address?
This paper attempts to solve the problem of identification and removal of multiplets in single - cell RNA sequencing (scRNA - seq). Specifically, the authors propose a multiplet identification method based on the Gaussian Mixture Model - GMM - Demux. The following are the specific problems that the paper attempts to solve: 1. **Identify and remove multiplets**: Multiplets can lead to false cell types in the data set, thus affecting the reliability and scalability of single - cell RNA sequencing results. GMM - Demux accurately identifies and removes these multiplets through sample barcoding techniques such as Cell Hashing and MULTI - seq. 2. **Prediction of multiplets in experimental planning**: Before conducting single - cell sequencing experiments, predicting the incidence of multiplets is crucial for experimental design. GMM - Demux can estimate the proportions of multiplets, single - sample multiplets (SSMs) and singlets in future experiments. 3. **Verify the authenticity of newly discovered cell types**: Multiplets may be misidentified as new rare cell types. GMM - Demux uses a titration formation model to verify whether the potential new cell types discovered from single - cell sequencing data sets are real pure - type GEMs (pure - type GEMs), rather than false phony - type GEMs. ### Specific implementation methods GMM - Demux achieves the above goals through the following steps: - **Classification based on the Gaussian Mixture Model**: GMM - Demux independently fits the HTO UMI counts of each sample into a Gaussian Mixture Model and calculates the posterior probability that each GEM contains cells from the corresponding sample. - **Calculate the probabilities of MSM and SSD**: Based on the posterior probability, GMM - Demux calculates the probability that each GEM is a multi - sample multiplet (MSM) or a single - sample multiplet (SSD). - **Estimate the proportions of SSM and singlet**: In SSDs, GMM - Demux uses an enhanced binomial probability model to estimate the proportions of SSMs and singlets. - **Verify the legality of new cell types**: GMM - Demux checks whether the GEM clusters defined by the proposed potential new cell types are pure - type GEM clusters or false - type GEM clusters, and thereby verifies the legality of new cell types. ### Experimental verification To verify the performance of GMM - Demux, the authors conducted multiple experiments, including two internally generated Cell Hashing and CITE - seq data sets and one publicly available Cell Hashing data set. In addition, 9 simulated data sets were generated, covering scenarios such as different numbers of samples, MSM percentages, and sample imbalance degrees. The experimental results show that GMM - Demux exhibits high consistency and accuracy in all tests, significantly outperforming other existing classifiers. In conclusion, this paper solves the key problems of identification and removal of multiplets in single - cell RNA sequencing by proposing the GMM - Demux method, improving the reliability and scalability of single - cell sequencing data.