Lower-dimensional projections of cellular expression improves cell type classification from single-cell RNA sequencing

Muhammad Umar,Muhammad Asif,Arif Mahmood
2024-10-14
Abstract:Single-cell RNA sequencing (scRNA-seq) enables the study of cellular diversity at single cell level. It provides a global view of cell-type specification during the onset of biological mechanisms such as developmental processes and human organogenesis. Various statistical, machine and deep learning-based methods have been proposed for cell-type classification. Most of the methods utilizes unsupervised lower dimensional projections obtained from for a large reference data. In this work, we proposed a reference-based method for cell type classification, called EnProCell. The EnProCell, first, computes lower dimensional projections that capture both the high variance and class separability through an ensemble of principle component analysis and multiple discriminant analysis. In the second phase, EnProCell trains a deep neural network on the lower dimensional representation of data to classify cell types. The proposed method outperformed the existing state-of-the-art methods when tested on four different data sets produced from different single-cell sequencing technologies. The EnProCell showed higher accuracy (98.91) and F1 score (98.64) than other methods for predicting reference from reference datasets. Similarly, EnProCell also showed better performance than existing methods in predicting cell types for data with unknown cell types (query) from reference datasets (accuracy:99.52; F1 score: 99.07). In addition to improved performance, the proposed methodology is simple and does not require more computational resources and time. the EnProCell is available at <a class="link-external link-https" href="https://github.com/umar1196/EnProCell" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Genomics
What problem does this paper attempt to address?
The paper attempts to address the issue of accuracy in cell type classification within single-cell RNA sequencing (scRNA-seq) data. Specifically, the authors propose a new reference data-driven method called EnProCell to improve cell type classification performance from single-cell RNA sequencing data. ### Background and Problem Single-cell RNA sequencing technology (scRNA-seq) enables researchers to study cell diversity at the single-cell level, providing a global view of cell type specificity in biological mechanisms such as developmental processes and human organogenesis. However, the data generated from scRNA-seq experiments are typically high-dimensional and sparse, posing challenges for cell type classification. Existing methods mostly rely on unsupervised low-dimensional projections, which, while capturing high variance in the data, do not necessarily ensure class separability, thereby affecting classification performance. ### Solution To overcome this issue, the authors propose the EnProCell method. The main steps of this method include: 1. **Low-Dimensional Projection**: Combining Principal Component Analysis (PCA) and Multiple Discriminant Analysis (MDA) to obtain low-dimensional projections. PCA is used to capture components with high variance, while MDA ensures class separability. 2. **Deep Neural Network Training**: Training a deep neural network on the low-dimensional representation of the data to classify cell types. ### Experimental Results The authors tested the EnProCell method on multiple datasets and compared it with existing state-of-the-art methods. The results show that EnProCell outperforms other methods on several metrics, particularly in terms of classification performance on both reference and query datasets. Specifically: - **Reference Dataset Classification**: EnProCell achieved higher accuracy and F1 scores than other methods on datasets such as PBMC1, Baron, and Muraro. - **Query Dataset Classification**: EnProCell also excelled in predicting unknown cell types in query datasets, significantly improving accuracy and F1 scores. ### Main Contributions 1. **Improved Low-Dimensional Projection**: By combining PCA and MDA, EnProCell better captures high variance and class separability in the data. 2. **Efficient Classification Performance**: EnProCell demonstrates excellent classification performance across multiple datasets without requiring excessive computational resources and time. 3. **Broad Applicability**: The method is not only suitable for reference datasets but also effectively handles query datasets, showing high generalization capability. Overall, the paper addresses the accuracy issue in cell type classification within single-cell RNA sequencing data by proposing the EnProCell method, providing new tools and insights for research in the related field.