Abstract:We develop a "block" LASSO (blockLASSO) method for training polygenic scores (PGS) and demonstrate its use in All of Us (AoU) and the UK Biobank (UKB). BlockLASSO utilizes the approximate block diagonal structure (due to chromosomal partition of the genome) of linkage disequilibrium (LD). LASSO optimization is performed chromosome by chromosome, which reduces computational complexity by orders of magnitude. The resulting predictors for each chromosome are combined using simple re-weighting techniques. We demonstrate that blockLASSO is generally as effective for training PGS as (global) LASSO and other approaches. This is shown for 11 different phenotypes, in two different biobanks, and across 5 different ancestry groups (African, American, East Asian, European, and South Asian). The block approach works for a wide variety of phenotypes. In the past, it has been shown that some phenotypes are more/less polygenic than others. Using sparse algorithms, an accurate PGS can be trained for type 1 diabetes (T1D) using 100 single nucleotide variants (SNVs). On the other extreme, a PGS for body mass index (BMI) would need more than 10k SNVs. blockLasso produces similar PGS for phenotypes while training with just a fraction of the variants per block. For example, within AoU (using only genetic information) block PGS for T1D (1,500 cases/113,297 controls) reaches an AUC of 0.63+-0.02 and for BMI (102,949 samples) a correlation of 0.21+-0.01. This is compared to a traditional global LASSO approach which finds for T1D an AUC 0.65+-0.03 and BMI a correlation 0.19+-0.03. Similar results are shown for a total of 11 phenotypes in both AoU and the UKB and applied to all 5 ancestry groups as defined via an Admixture analysis. In all cases the contribution from common covariates - age, sex assigned at birth, and principal components - are removed before training. This new block approach is more computationally efficient and scalable than global machine learning approaches. Genetic matrices are typically stored as memory mapped instances, but loading a million SNVs for a million participants can require 8TB of memory. Running a LASSO algorithm requires holding in memory at least two matrices this size. This requirement is so large that even large high performance computing clusters cannot perform these calculations. To circumvent this issue, most current analyses use subsets: e.g., taking a representative sample of participants and filtering SNVs via pruning and thresholding. High-end LASSO training uses ~ 500 GB of memory (e.g., ~ 400k samples and ~ 50k SNVs) and takes 12-24 hours to complete. In contrast, the block approach typically uses ~ 200x (2 orders of magnitude) less memory and runs in ~ 500x less time.

Efficient blockLASSO for Polygenic Scores with Applications to All of Us and UK Biobank

A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank

Genomic Prediction of Complex Disease Risk

Genomic Prediction of 16 Complex Disease Risks Including Heart Attack, Diabetes, Breast and Prostate Cancer

LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK

Large-scale Genotyping of Complex DNA

Improving on polygenic scores across complex traits using select and shrink with summary statistics (S4) and LDpred2

A Machine-Learning Heuristic to Improve Gene Score Prediction of Polygenic Traits

Improved polygenic prediction by Bayesian multiple regression on summary statistics

Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank

Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank

All of Us diversity and scale improve polygenic prediction contextually with greatest improvements for under-represented populations

Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets

A Fast and Accurate Method for Genome-Wide Time-to-Event Data Analysis and Its Application to UK Biobank

Evaluation of polygenic scoring methods in five biobanks shows larger variation between biobanks than methods and finds benefits of ensemble learning

Efficient penalized generalized linear mixed models for variable selection and genetic risk prediction in high-dimensional data

Improving GWAS performance in underrepresented groups by appropriate modeling of genetics, environment, and sociocultural factors

Deep learning for polygenic prediction: The role of heritability, interaction type and sample size

Computing linkage disequilibrium aware genome embeddings using autoencoders

Variable prediction accuracy of polygenic scores within an ancestry group

Risk factors affecting polygenic score performance across diverse cohorts