Deep-learning prediction of gene expression from personal genomes

Shiron Drusinsky,Sean Whalen,Katherine S. Pollard
DOI: https://doi.org/10.1101/2024.07.27.605449
2024-07-27
Abstract:Models that predict RNA levels from DNA sequences show tremendous promise for decoding tissue-specific gene regulatory mechanisms, revealing the genetic architecture of traits, and interpreting noncoding genetic variation. Existing methods take two different approaches: 1) associating expression with linear combinations of common genetic variants (training across individuals on single genes), or 2) learning genome-wide sequence-to-expression rules with neural networks (training across loci using a reference genome). Since limitations of both strategies have been highlighted recently, we sought to combine the sequence context provided by deep learning with the information provided by cross-individual training. We utilized fine-tuning to develop Performer, a model with accuracy approaching the cis-heritability of most genes. Performer prioritizes genetic variants across the allele frequency spectrum that disrupt motifs, fall in annotated regulatory elements, and have functional evidence for modulating gene expression. While obstacles remain in personalized expression prediction, our findings establish deep learning as a viable strategy.
Genetics
What problem does this paper attempt to address?
The main goal of this paper is to address the problem of predicting gene expression from individual genomic data using deep learning methods and to attempt to overcome some of the limitations of existing methods. Specifically, the paper addresses the following issues: 1. **Combining the advantages of two methods**: Existing methods have two main approaches to predict gene expression: one is through association analysis, performing a linear combination of common genetic variations of a single gene across individuals; the other is using neural networks to learn the rules from the entire genome sequence to expression. Each of these methods has its pros and cons. The paper aims to combine the strengths of both, utilizing the sequence context information provided by deep learning and the information provided by cross-individual training. 2. **Improving the limitations of existing deep learning models**: Current deep learning models have shortcomings in explaining the differences in gene expression between individuals, especially in predicting the direction of expression quantitative trait loci (eQTL). Additionally, these models often fail to reliably explain expression variation among different individuals. 3. **Developing a new deep learning model—Performer**: To overcome the aforementioned limitations, the researchers developed a new model called Performer. This model employs a fine-tuning strategy to achieve cross-individual training, thereby improving model performance. The Performer model can better capture the cis-heritability of gene expression and prioritize genetic variations that affect gene expression. 4. **Evaluating the performance of the new model**: Through experiments on a large number of samples from the GTEx dataset, the paper demonstrates that the Performer model outperforms existing deep learning models and linear models in predicting individual gene expression. Performer not only explains more of the heritability of expression but also correctly predicts the direction of the impact of genetic variations on gene expression. In summary, the goal of this paper is to develop and evaluate a new deep learning model—Performer, to address the limitations of existing models in predicting gene expression from individual genomic data.