scSparkXMBD - High-Performance scRNA-seq Data Processing with Spark.

Yu Liu,Mingxuan Gao,Lixuan Tan,Hongjin Liu,Yating Lin,Wenxian Yang,Rongshan Yu
DOI: https://doi.org/10.1109/BIBM52615.2021.9669512
2021-01-01
Abstract:High-throughput single-cell RNA sequencing (scRNA-seq) data processing pipelines integrate multiple modules to transform raw scRNA-seq data to gene expression matrices, including barcode processing, sequence quality control, genome alignment and transcript quantification. With the rapid growth in data volume, the speed of scRNA-seq data processing pipeline has become a major bottleneck to large-scale scRNA-seq studies. We present scSpark <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">XMBD</sup> <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> (denoted as scSpark), a cloud computing based scRNA-seq data processing pipeline. By leveraging the in-memory computing capability of Apache Spark, scSpark significantly improves the processing speed of scRNA-seq data, and achieves around 5-20 times faster than the state-of-the-art processing pipelines under the same CPU core consumption. In addition, thanks to the inherent scalability of Spark in a cloud computing environment, scSpark can further reduce the processing time for a typical scRNA-seq dataset (e.g., 640 million reads) from hours to minutes when multiple computer nodes (e.g., 16) are used. Biological evaluation also confirmed that the results generated by scSpark are highly consistent with existing scRNA-seq data processing pipelines. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> XMBD refers to Xiamen Big Data, which is a biomedical open software initiative in the National Institute for Data Science in Health and Medicine, Xiamen University, China
What problem does this paper attempt to address?