SPLENDID incorporates continuous genetic ancestry in biobank-scale data to improve polygenic risk prediction across diverse populations

Tony Chen,Haoyu Zhang,Rahul Mazumder,Xihong Lin
DOI: https://doi.org/10.1101/2024.10.14.618256
2024-10-17
Abstract:Polygenic risk scores are widely used in disease risk stratification, but their accuracy varies across diverse populations. Recent methods large-scale leverage multi-ancestry data to improve accuracy in under-represented populations but require labelling individuals by ancestry for prediction. This poses challenges for practical use, as clinical practices are typically not based on ancestry. We propose SPLENDID, a novel penalized regression framework for diverse biobank-scale data. Our method utilizes ancestry principal component interactions to model genetic ancestry as a continuum within a single prediction model for all ancestries, eliminating the need for discrete labels. In extensive simulations and analyses of 9 traits from the All of Us Research Program (N=224,364) and UK Biobank (N=340,140), SPLENDID significantly outperformed existing methods in prediction accuracy and model sparsity. By directly incorporating continuous genetic ancestry in model training, SPLENDID stands as a valuable tool for robust risk prediction across diverse populations and fairer clinical implementation.
Genetics
What problem does this paper attempt to address?