The impact of package selection and versioning on single-cell RNA-seq analysis

Joseph M Rich,Lambda Moses,Pétur Helgi Einarsson,Kayla Jackson,Laura Luebbert,A. Sina Booeshaghi,Sindri Antonsson,Delaney K. Sullivan,Nicolas Bray,Páll Melsted,Lior Pachter
DOI: https://doi.org/10.1101/2024.04.04.588111
2024-04-11
Abstract:Standard single-cell RNA-sequencing analysis (scRNA-seq) workflows consist of converting raw read data into cell-gene count matrices through sequence alignment, followed by analyses including filtering, highly variable gene selection, dimensionality reduction, clustering, and differential expression analysis. Seurat and Scanpy are the most widely-used packages implementing such workflows, and are generally thought to implement individual steps similarly. We investigate in detail the algorithms and methods underlying Seurat and Scanpy and find that there are, in fact, considerable differences in the outputs of Seurat and Scanpy. The extent of differences between the programs is approximately equivalent to the variability that would be introduced in benchmarking scRNA-seq datasets by sequencing less than 5% of the reads or analyzing less than 20% of the cell population. Additionally, distinct versions of Seurat and Scanpy can produce very different results, especially during parts of differential expression analysis. Our analysis highlights the need for users of scRNA-seq to carefully assess the tools on which they rely, and the importance of developers of scientific software to prioritize transparency, consistency, and reproducibility for their tools.
Bioinformatics
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the result differences in single - cell RNA sequencing analysis (scRNA - seq) due to the selection of different software packages and their versions. Specifically, the paper explores the following points: 1. **Impact of software package selection**: The paper compares Seurat and Scanpy, the two most commonly used scRNA - seq analysis tools, and studies the output differences when they analyze the same data set under the default settings. These differences include multiple steps such as cell and gene filtering, highly variable gene selection, principal component analysis (PCA), shared nearest neighbor graph (SNN), clustering, t - SNE and UMAP dimension reduction, and differential expression analysis (DE). 2. **Impact of version updates**: In addition to the comparison between different software packages, the paper also explores the differences between different versions of the same software package. For example, the differences between Seurat v5 and v4, Scanpy v1.9 and v1.4, and Cell Ranger v7 and v6. These differences are mainly reflected in the selection of significant marker genes, logFC estimation, and adjusted p - value calculation. 3. **Impact of data volume reduction**: The paper evaluates the impact of data volume reduction on the analysis results through simulated read and cell down - sampling. The study finds that even if the data volume is reduced to a very small part of the original data, most analysis steps can still retain most of the information, especially in terms of read down - sampling. 4. **Impact of random seeds**: The paper also explores the impact of random seeds on certain analysis steps (such as approximate KNN search, Louvain/Leiden clustering, and UMAP dimension reduction) to evaluate the variability of these steps under different random seeds. In summary, the main purpose of this paper is to quantify and understand the impacts of factors such as software package selection, version updates, data volume reduction, and random seeds on the results in scRNA - seq analysis, thereby emphasizing that users need to be cautious when selecting and using these tools, and emphasizing that developers should improve the transparency, consistency, and reproducibility of the tools.