Abstract:The genetic control of gene expression is a core component of human physiology. For the past several years, transcriptome-wide association studies have leveraged large datasets of linked genotype and RNA sequencing information to create a powerful gene-based test of association that has been used in dozens of studies. While numerous discoveries have been made, the populations in the training data are overwhelmingly of European descent, and little is known about the generalizability of these models to other populations. Here, we test for cross-population generalizability of gene expression prediction models using a dataset of African American individuals with RNA-Seq data in whole blood. We find that the default models trained in large datasets such as GTEx and DGN fare poorly in African Americans, with a notable reduction in prediction accuracy when compared to European Americans. We replicate these limitations in cross-population generalizability using the five populations in the GEUVADIS dataset. Via realistic simulations of both populations and gene expression, we show that accurate cross-population generalizability of transcriptome prediction only arises when eQTL architecture is substantially shared across populations. In contrast, models with non-identical eQTLs showed patterns similar to real-world data. Therefore, generating RNA-Seq data in diverse populations is a critical step towards multi-ethnic utility of gene expression prediction.Advances in RNA sequencing technology have reduced the cost of measuring gene expression at a genome-wide level. However, sequencing enough human RNA samples for adequately-powered disease association studies remains prohibitively costly. To this end, modern transcriptome-wide association analysis tools leverage existing paired genotype-expression datasets by creating models to predict gene expression using genotypes. These predictive models enable researchers to perform cost-effective association tests with gene expression in independently genotyped samples. However, most of these models use European reference data, and the extent to which gene expression prediction models work across populations is not fully resolved. We observe that these models predict gene expression worse than expected in a dataset of African-Americans when derived from European-descent individuals. Using simulations, we show that gene expression predictive model performance depends on both the proportion of genetic variants shared between population-specific prediction models as well as the genetic relatedness between populations. Our findings suggest a need to carefully select reference populations for prediction and point to a pressing need for more genetically diverse genotype-expression datasets.

Variable effects of steroid withdrawal on blood pressure reduction in cyclosporine-treated renal transplant recipients.

Deep-learning prediction of gene expression from personal genomes

Should we really use graph neural networks for transcriptomic prediction?

Enhancing Personalized Gene Expression Prediction From DNA Sequences Using Genomic Foundation Models

A deep auto-encoder model for gene expression prediction

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Benchmarking DNA Foundation Models for Genomic Sequence Classification

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Does your model understand genes? A benchmark of gene properties for biological and text models

Effective gene expression prediction from sequence by integrating long-range interactions

Evaluation and optimization of sequence-based gene regulatory deep learning models

Scleral birefringence as measured by polarization-sensitive optical coherence tomography and ocular biometric parameters of human eyes in vivo.

A sandbox for prediction and integration of DNA, RNA, and proteins in single cells

On the cross-population generalizability of gene expression prediction models

Deep learning approaches for non-coding genetic variant effect prediction: current progress and future prospects

Deep Learning to Analyze RNA-Seq Gene Expression Data

A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language

Advancing regulatory genomics with machine learning

Fine-tuning sequence-to-expression models on personal genome and transcriptome data

Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data