CellMemory: Hierarchical Interpretation of Out-of-Distribution Cells Using Bottlenecked Transformer

Qifei Wang,He Zhu,Yiwen Hu,Yanjie Chen,Yuwei Wang,Xuegong Zhang,James Zou,Manolis Kellis,Yue Li,Dianbo Liu,Lan Jiang
DOI: https://doi.org/10.1101/2024.12.17.628533
2024-12-20
Abstract:Identifying the genetic and molecular drivers of phenotypic heterogeneity among individuals is vital for understanding human health and for diagnosing, monitoring, and treating diseases. To this end, international consortia such as the Human Cell Atlas and the Tabula Sapiens are creating comprehensive cellular references. Due to the massive volume of data generated, machine learning methods, especially transformer architectures, have been widely employed in related studies. However, applying machine learning to cellular data presents several challenges. One such challenge is making the methods interpretable with respect to both the input cellular information and its context. Another less explored challenge is the accurate representation of cells outside existing references, referred to as out-of-distribution (OOD) cells. The out-of-distribution could be attributed to various physiological conditions, such as comparing diseased cells, particularly tumor cells, with healthy reference data, or significant technical variations, such as using transfer learning from single-cell reference to spatial query data. Inspired by the global workspace theory in cognitive neuroscience, we introduce CellMemory, a bottlenecked Transformer with improved generalization capabilities designed for the hierarchical interpretation of OOD cells unseen during reference building. Even without pre-training, it exceeds the performance of large language models pre-trained with tens of millions of cells. In particular, when deciphering spatially resolved single-cell transcriptomics data, CellMemory demonstrates the ability to interpret data at the granule level accurately. Finally, we harness CellMemory's robust representational capabilities to elucidate malignant cells and their founder cells in different patients, providing reliable characterizations of the cellular changes caused by the disease.
Bioinformatics
What problem does this paper attempt to address?