Sfaira Accelerates Data and Model Reuse in Single Cell Genomics

David S. Fischer,Leander Dony,Martin König,Abdul Moeed,Luke Zappia,Sophie Tritschler,Olle Holmberg,Hananeh Aliee,Fabian J. Theis
DOI: https://doi.org/10.1101/2020.12.16.419036
IF: 17.906
2020-01-01
Genome Biology
Abstract:Exploratory analysis of single-cell RNA-seq data sets is currently based on statistical and machine learning models that are adapted to each new data set from scratch. A typical analysis workflow includes a choice of dimensionality reduction, selection of clustering parameters, and mapping of prior annotation. These steps typically require several iterations and can take up significant time in many single-cell RNA-seq projects. Here, we introduce sfaira, which is a single-cell data and model zoo which houses data sets as well as pre-trained models. The data zoo is designed to facilitate the fast and easy contribution of data sets, interfacing to a large community of data providers. Sfaira currently includes 233 data sets across 45 organs and 3.1 million cells in both human and mouse. Using these data sets we have trained eight different example model classes, such as autoencoders and logistic cell type predictors: The infrastructure of sfaira is model agnostic and allows training und usage of many previously published models. Sfaira directly aids in exploratory data analysis by replacing embedding and cell type annotation workflows with end-to-end pre-trained parametric models. As further example use cases for sfaira, we demonstrate the extraction of gene-centric data statistics across many tissues, improved usage of cell type labels at different levels of coarseness, and an application for learning interpretable models through data regularization on extremely diverse data sets.
What problem does this paper attempt to address?