Evaluating the role of pre-training dataset size and diversity on single-cell foundation model performance

Alan DenAdel,Madeline Hughes,Akshaya Thoutam,Anay Gupta,Andrew W. Navia,Nicolo Fusi,Srivatsan Raghavan,Peter S. Winter,Ava Pardis Amini,Lorin Crawford
DOI: https://doi.org/10.1101/2024.12.13.628448
2024-12-17
Abstract:The success of transformer-based foundation models on natural language and images has motivated their use in single-cell biology. Single-cell foundation models have been trained on increasingly larger transcriptomic datasets, scaling from initial studies with 1 million cells to newer atlases with over 100 million cells. This study investigates the role of pre-training dataset size and diversity on the performance of single-cell foundation models on both zero-shot and fine-tuned tasks. Using a large corpus of 22.2 million cells, we pre-train a total of 375 models which we evaluate by conducting 3,750 experiments. Our results show that current methods tend to plateau in performance with pre-training datasets that are only a fraction of the size.
Biology
What problem does this paper attempt to address?