A human ensemble cell atlas (hECA) enables in data cell sorting

Sijie Chen,Yanting Luo,Haoxiang Gao,Fanhong Li,Jiaqi Li,Yixin Chen,Renke You,Minsheng Hao,Haiyang Bian,Xi Xi,Wenrui Li,Weiyu Li,Mingli Ye,Qiuchen Meng,Ziheng Zou,Chen Li,Haochen Li,Yangyuan Zhang,Yanfei Cui,Lei Wei,Fufeng Chen,Xiaowo Wang,Hairong Lv,Kui Hua,Rui Jiang,Xuegong Zhang
DOI: https://doi.org/10.1101/2021.07.21.453289
2021-01-01
Abstract:The significance of building atlases of human cells as references for future biological and medical studies of human in health or disease has been well recognized. Comparing to the rapidly accumulation of single-cell data, there has been fewer published work on the information structure to assemble cell atlases, or on methods for using reference atlases once they are ready. Most existing cell atlas work organize single-cell gene expression data as a collection of individual files, allowing users to download selected data sheets, or to annotate query cells using models pretrained with the collected data. These features are useful as the basic use of cell atlases. More comprehensive uses of global cell atlases can be developed once data of cells from multiple organs across different studies can be assembled into one orchestrated data repository rather than a collection of data files. For this purpose, we presented a unified giant table or uGT to store and organize single-cell data from multiple studies into a single huge data repository, and a unified hierarchical annotation framework or uHAF to annotate cells from uncoordinated studies. Based on these technologies, we developed a system that enables users to design complex rules to recruit from the atlas cells that meet certain conditions, such as with desired expression range of a gene or multiple genes and with required organ, tissue origins or developmental stages, across multiple datasets that were otherwise unconnected. The conditions can be expressed as sophisticated logic criteria to pinpoint specific cells that cannot be easily spotted in traditional in vivo or in vitro cell sorting or in traditional searching in published data. We name this technology as in data cell sorting from cell atlases. With the increasing coverage of the cell atlas, this in data experiment paradigm will facilitate scientists to conduct investigations in the data space beyond the restrictions in traditional in vivo and in vitro experiments. In the current work, we collected scRNA-seq data of more than 1 million human cells from scattered studies and assembled them as a human Ensemble Cell Atlas or hECA using the proposed information structure, and provided comprehensive tools for in data experiments based on the atlas. Case examples on agile construction of atlases of particular cell types and on off-target prediction of targeted therapy showed that in data cell sorting is an efficient and effective way for comprehensive discoveries. hECA provides a powerful platform for assembling massive scattered single-cell data into a unified atlas, and can serve as a prototype for building future cell atlases.
What problem does this paper attempt to address?