scHyena: Foundation Model for Full-Length Single-Cell RNA-Seq Analysis in Brain

Gyutaek Oh,Baekgyu Choi,Inkyung Jung,Jong Chul Ye
2023-10-04
Abstract:Single-cell RNA sequencing (scRNA-seq) has made significant strides in unraveling the intricate cellular diversity within complex tissues. This is particularly critical in the brain, presenting a greater diversity of cell types than other tissue types, to gain a deeper understanding of brain function within various cellular contexts. However, analyzing scRNA-seq data remains a challenge due to inherent measurement noise stemming from dropout events and the limited utilization of extensive gene expression information. In this work, we introduce scHyena, a foundation model designed to address these challenges and enhance the accuracy of scRNA-seq analysis in the brain. Specifically, inspired by the recent Hyena operator, we design a novel Transformer architecture called singe-cell Hyena (scHyena) that is equipped with a linear adaptor layer, the positional encoding via gene-embedding, and a {bidirectional} Hyena operator. This enables us to process full-length scRNA-seq data without losing any information from the raw data. In particular, our model learns generalizable features of cells and genes through pre-training scHyena using the full length of scRNA-seq data. We demonstrate the superior performance of scHyena compared to other benchmark methods in downstream tasks, including cell type classification and scRNA-seq imputation.
Machine Learning,Artificial Intelligence,Genomics,Quantitative Methods
What problem does this paper attempt to address?
The paper aims to address several key challenges in single-cell RNA sequencing (scRNA-seq) data analysis. Specifically: 1. **Dropout phenomenon**: Due to the very limited amount of mRNA in individual cells, gene expression information is easily lost during sequencing, resulting in a large number of zero values in scRNA-seq data. Therefore, distinguishing between true zero values and zero values caused by dropout events becomes an important task. 2. **Long sequence processing problem**: scRNA-seq data typically contains expression level information for tens of thousands of genes. Traditional analysis methods often require selecting highly variable genes (HVGs), which not only introduces sensitivity issues in parameter selection but may also lead to information loss. Therefore, a method capable of handling all gene information is needed. To address these issues, the authors propose the scHyena model. This model is based on the Transformer architecture and introduces the Hyena operator to process full-length scRNA-seq data without the need for dimensionality reduction or HVG selection. Through a pre-training process, scHyena can learn general features of cells and genes and demonstrates superior performance in downstream tasks such as cell type classification and scRNA-seq imputation. Experimental results show that scHyena outperforms existing benchmark methods across multiple datasets.