Abstract:The genetic control of gene expression is a core component of human physiology. For the past several years, transcriptome-wide association studies have leveraged large datasets of linked genotype and RNA sequencing information to create a powerful gene-based test of association that has been used in dozens of studies. While numerous discoveries have been made, the populations in the training data are overwhelmingly of European descent, and little is known about the generalizability of these models to other populations. Here, we test for cross-population generalizability of gene expression prediction models using a dataset of African American individuals with RNA-Seq data in whole blood. We find that the default models trained in large datasets such as GTEx and DGN fare poorly in African Americans, with a notable reduction in prediction accuracy when compared to European Americans. We replicate these limitations in cross-population generalizability using the five populations in the GEUVADIS dataset. Via realistic simulations of both populations and gene expression, we show that accurate cross-population generalizability of transcriptome prediction only arises when eQTL architecture is substantially shared across populations. In contrast, models with non-identical eQTLs showed patterns similar to real-world data. Therefore, generating RNA-Seq data in diverse populations is a critical step towards multi-ethnic utility of gene expression prediction.Advances in RNA sequencing technology have reduced the cost of measuring gene expression at a genome-wide level. However, sequencing enough human RNA samples for adequately-powered disease association studies remains prohibitively costly. To this end, modern transcriptome-wide association analysis tools leverage existing paired genotype-expression datasets by creating models to predict gene expression using genotypes. These predictive models enable researchers to perform cost-effective association tests with gene expression in independently genotyped samples. However, most of these models use European reference data, and the extent to which gene expression prediction models work across populations is not fully resolved. We observe that these models predict gene expression worse than expected in a dataset of African-Americans when derived from European-descent individuals. Using simulations, we show that gene expression predictive model performance depends on both the proportion of genetic variants shared between population-specific prediction models as well as the genetic relatedness between populations. Our findings suggest a need to carefully select reference populations for prediction and point to a pressing need for more genetically diverse genotype-expression datasets.

On the cross-population generalizability of gene expression prediction models

Cross-population enhancement of PrediXcan predictions with a gnomAD-based east Asian reference framework

Sources of gene expression variation in a globally diverse human cohort

Powerful mapping of cis-genetic effects on gene expression across diverse populations reveals novel disease-critical genes

Allele frequency impacts the cross-ancestry portability of gene expression prediction in lymphoblastoid cell lines

Deep-learning prediction of gene expression from personal genomes

Leveraging trans-ethnic genetic risk scores to improve association power for complex traits in underrepresented populations

Public RNA-seq data are not representative of global human diversity

Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings

Comparing statistical learning methods for complex trait prediction from gene expression

Genetic analyses of diverse populations improves discovery for complex traits

Population-Matched Transcriptome Prediction Increases TWAS Discovery and Replication Rate

Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data

Genotype prediction of 336,463 samples from public expression data

Cross-Population Joint Analysis of eQTLs: Fine Mapping and Functional Annotation

Predicting the genetic component of gene expression using gene regulatory networks

Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression

Transcriptome-Wide Association Study of Blood Cell Traits in African Ancestry and Hispanic/Latino Populations

Genetic variants associated with cell-type-specific intra-individual gene expression variability reveal new mechanisms of genome regulation

Aberrant Gene Expression in Humans

Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations