scaLR: a low-resource deep neural network-based platform for single cell analysis and biomarker discovery

Saiyam Jogani Jr.,Anand Santosh Pol Sr.,Mayur Prajapati II,Amit Samal,Kriti Bhatia Jr.,Jayendra Parmar,Urvik Patel,Falak Shah,Nisarg Vyas,Saurabh Gupta Sr.
DOI: https://doi.org/10.1101/2024.09.19.613226
2024-09-20
Abstract:Purpose: Single-cell RNA sequencing (scRNA-seq) is producing vast amounts of individual cell profiling data. Analysis of such datasets presents a significant challenge in accurately annotating cell types and their associated biomarkers. scRNA-seq datasets analysis will help us understand diseases such as Alzheimer's, Cancer, Diabetes, Coronavirus disease 2019 (COVID-19), Systemic Lupus Erythematosus (SLE), etc. Recently different pipelines based on machine learning (ML) and Deep Neural Network (DNN) methods have been employed to tackle these issues utilizing scRNA-seq datasets. These pipelines have arisen as a promising resource and are capable of extracting meaningful and concise features from noisy, diverse, and high-dimensional data to enhance annotations and subsequent analysis. Existing tools require high computational resources to execute large sample datasets. Methods: We have developed a cutting-edge platform known as scaLR (Single Cell Analysis using Low Resource) that efficiently processes data in batches, and reduces the required resources for processing large datasets and running NN models. scaLR is equipped with data processing, feature extraction, training, evaluation, and downstream analysis. The data processing module consists of sample-wise & standard scaler normalization and splitting of data. Its novel feature extraction algorithm, first trains the model on a feature subset and stores feature importance for all the features in that subset. At the end of this process, top K features are selected based on their importance. The model is trained on top K features, its performance evaluation and associated downstream analysis provide significant biomarkers for different cell types and diseases/traits. Results: To showcase the capabilities of scaLR, we utilized several scRNA-seq datasets of Peripheral Blood Mononuclear Cells (PBMCs), Alzheimer patients, and large datasets from human and mouse embryonic development. Our findings indicate that scaLR offers comparable prediction accuracy and requires less model training time and compute resources than existing Python-based pipelines and frameworks. Moreover, scaLR efficiently handles large sample datasets (>11.4 million cells) with minimal resource usage (29GB RAM, 12GB GPU, and 8 CPU) while maintaining high prediction accuracy and being capable of ranking the biomarker association with specific cell types and diseases. Conclusion We present scaLR a Python-based platform, engineered to utilize minimal computational resources while maintaining comparable execution times to existing frameworks. It is highly scalable and capable of efficiently handling datasets containing millions of cell samples and providing their classification and important biomarkers.
Bioinformatics
What problem does this paper attempt to address?