scCompass: An integrated cross-species scRNA-seq database for AI-ready
Pengfei Wang,Wenhao Liu,Jiajia Wang,Yana Liu,Pengjiang Li,Ping Xu,Wentao Cui,Ran Zhang,Qingqing Long,Zhilong Hu,Chen Fang,Jingxi Dong,Chunyang Zhang,Yan Chen,Chengrui Wang,Guole Liu,Hanyu Xie,Yiyang Zhang,Meng Xiao,Shubai Chen,Yiqiang Chen,Ge Yang,Shihua Zhang,Zhen Meng,Xuezhi Wang,Guihai Feng,Xin Li,Yuanchun Zhou
DOI: https://doi.org/10.1101/2024.11.12.623138
2024-11-15
Abstract:Emerging single-cell sequencing technology has generated large amounts of data, allowing analysis of cellular dynamics and gene regulation at the single-cell resolution. Advances in artificial intelligence enhance life sciences research by delivering critical insights and optimizing data analysis processes. However, inconsistent data processing quality and standards remain to be a major challenge. Here we propose scCompass, which provides a data quality solution to build a large-scale, cross-species and model-friendly single-cell data collection. By applying standardized data pre-processing, scCompass integrates and curates transcriptomic data from 13 species and nearly 105 million single cells. Using this extensive dataset, we are able to archieve stable expression genes (SEGs) and organ-specific expression genes (OSGs) in human and mouse. We provide different scalable datasets that can be easily adapted for AI model training and the pretrained checkpoints with state-of-the-art (SOTA) single-cell foundataion models. In summary, the AI-readiness of scCompass, which combined with user-friendly data sharing, visualization and online analysis, greatly simplifies data access and exploitation for researchers in single cell biology(http://www.bdbe.cn/kun).
Bioinformatics