Abstract:Background Polygenic risk scores (PRS) have ushered in a new era in genetic epidemiology, offering insights into individual predispositions to a wide range of diseases. However, despite recent marked enhancements in their predictive power, there are still challenges that need to be overcome before PRS-based models can be broadly applied in the clinic, including sufficient accuracy, easy interpretability and portability across diverse populations. Methods Leveraging trans-ancestry genome-wide association study (GWAS) meta-analysis, we generated novel, diverse summary statistics for 30 medically-related traits which were used to benchmark the performance of six existing PRS algorithms using UK biobank. Observing that SBayesRC had the best overall performance but recognizing strengths in each method, we developed an ensemble PRS model using logistic regression to combine outputs from top-performing algorithms. This ensemble model was validated on the diverse eMERGE and PAGE MEC cohorts, and the performance was compared against current state-of-the-art PRS models. To enhance predictive accuracy for clinical application, we incorporated easily-accessible clinical characteristics such as age, gender, ancestry and risk factors, creating disease prediction models intended as prospective diagnostic tests, with easily interpretable positive or negative outcomes. Results Predictive performance of PRS models improved with trans-ancestry GWAS meta-analysis and was further enhanced by the ensemble model, which surpassed state-of-art PRS models. When applied to external cohorts, performance drops were minimal, indicating good calibration. After adding clinical characteristics, 12 out of 30 models surpassed 80% AUC. Further, 25 traits exceeded the diagnostic odds ratio (DOR) of 5 and 19 traits exceeded DOR of 10 for all ancestry groups, indicating high predictive value. The highest DOR in a population with a sufficient number of cases was 66.2 for Alzheimer's disease in Europeans. Our PRS model for coronary artery disease identified 55-80 times more true coronary events than rare pathogenic variant models, reinforcing its clinical potential. The polygenic component modulated the effect of high-risk rare variants, stressing the need to consider all genetic components in clinical settings. Conclusions Newly developed PRS-based disease prediction models have sufficient accuracy and portability to warrant consideration of being used in the clinic.

SPLENDID incorporates continuous genetic ancestry in biobank-scale data to improve polygenic risk prediction across diverse populations

Improving genetic risk prediction across diverse population by disentangling ancestry representations

All of Us diversity and scale improve polygenic prediction contextually with greatest improvements for under-represented populations

An Ensemble Penalized Regression Method for Multi-ancestry Polygenic Risk Prediction

Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets

Improving polygenic risk prediction in admixed populations by explicitly modeling ancestral-differential effects via GAUDI

Multi-ancestry polygenic risk scores using phylogenetic regularization

Improving polygenic prediction in ancestrally diverse populations

Leveraging genetic ancestry continuum information to interpolate PRS for admixed populations

A Non-Parametric Method for Building Predictive Genetic Tests on High-Dimensional Data

Fast and accurate Bayesian polygenic risk modeling with variational inference

LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK

Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries

Variable prediction accuracy of polygenic scores within an ancestry group

Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank

Optimization of Multi-Ancestry Polygenic Risk Score Disease Prediction Models

Improved polygenic prediction by Bayesian multiple regression on summary statistics

A Deep Learning-based Genome-wide Polygenic Risk Score for Common Diseases Identifies Individuals with Risk

Quantifying Portable Genetic Effects and Improving Cross-Ancestry Genetic Prediction with GWAS Summary Statistics

Improving polygenic prediction from summary data by learning patterns of effect sharing across multiple phenotypes