How to improve polygenic prediction from whole-genome sequencing data by leveraging predicted epigenomic features?

Wanwen Zeng,Hanmin Guo,Qiao Liu,Wing H Wong

DOI: https://doi.org/10.1101/2024.10.04.24314860

2024-10-06

Abstract:Polygenic risk scores (PRS) are crucial in genetics for predicting individual susceptibility to complex diseases by aggregating the effects of numerous genetic variants. Whole-genome sequencing (WGS) has revolutionized our ability to detect rare and even de novo variants, creating an exciting opportunity for developing new PRS methods that can effectively leverage rare variants and capture the complex relationships among different variants. Furthermore, regulatory mechanisms play a crucial role in gene expression and disease manifestation, offering avenues to further enhance the performance and interpretation of PRS predictions. Through simulation studies, we highlighted aspects where current PRS methods face challenges when applied to WGS data, aiming to shed light on potential opportunities for further improvement. To address these challenges, we developed Epi-PRS, an approach that leverages the power of genomic large language models (LLM) to impute epigenomic signals across diverse cellular contexts, for use as intermediate variables between genotype and phenotype. A pretrained LLM is employed to transform genotypes into epigenomic signals using personal diploid sequences as inputs, and the genetic risk is then estimated based on the imputed personal epigenomic signals. Epi-PRS enhances the assessment of personal variant impacts, enabling a comprehensive and holistic consideration of genotypic and regulatory information within large genomic regions. Our simulation results demonstrated that incorporating the nuanced effects of non-linear models, rare variants, and regulatory information can provide more precise PRS prediction and better understanding of genetic risk. Applying Epi-PRS to real data from the UK Biobank, our results further showed that Epi-PRS significantly outperforms existing PRS methods in two major diseases: breast cancer and diabetes. This study suggests that PRS methods can benefit from incorporating non-linear models, rare variants, and regulatory information, highlighting the potential for significant advancements in disease risk modeling and enhancing the understanding of precision medicine.

Health Informatics

What problem does this paper attempt to address?

The paper attempts to address the problem of how to improve polygenic risk scores (PRS) by utilizing predicted epigenomic features in the context of whole-genome sequencing (WGS) data. Specifically, the paper points out that current PRS methods face challenges in the following areas: 1. **Limitations of Linear Models**: Many existing PRS methods rely on linear models, which assume that the effects of genetic variations are additive. However, in larger datasets, complex interactions between genetic variations may lead to non-additive effects that linear models cannot adequately capture. 2. **Underutilization of Rare Variants**: Traditional PRS methods primarily focus on common variants, neglecting rare and de novo variants. Although rare variants have lower frequencies, their impact on disease risk can be significant. 3. **Role of Regulatory Mechanisms**: Gene expression and disease manifestation are significantly influenced by regulatory mechanisms such as enhancers, promoters, and transcription factor binding sites. Existing PRS methods either ignore this regulatory information or simply incorporate it as part of a linear model. To address these issues, the researchers developed the Epi-PRS method, which integrates large-scale language models (LLM) of the genome to predict epigenomic signals in different cellular contexts and uses them as mediating variables between genotype and phenotype. This approach allows for a better assessment of the impact of individual variations, thereby improving the accuracy of PRS predictions. The study results indicate that Epi-PRS significantly outperforms existing methods in PRS predictions for breast cancer and diabetes, demonstrating its potential in disease risk modeling.

How to improve polygenic prediction from whole-genome sequencing data by leveraging predicted epigenomic features?

Optimization of Multi-Ancestry Polygenic Risk Score Disease Prediction Models

Methodologies underpinning polygenic risk scores estimation: a comprehensive overview

A Deep Learning-based Genome-wide Polygenic Risk Score for Common Diseases Identifies Individuals with Risk

A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants

Predicting the status of human complex diseases with random forest and polygenic risk scores

Risk assessment for colorectal cancer via polygenic risk score and lifestyle exposure: a large-scale association study of East Asian and European populations

Applying polygenic risk score methods to pharmacogenomics GWAS: challenges and opportunities

PRS-Net: Interpretable polygenic risk scores via geometric learning

Advancements and limitations in polygenic risk score methods for genomic prediction: a scoping review

Implementing polygenic risk scores in the clinic

Novel strategy for disease risk prediction incorporating predicted gene expression and DNA methylation data: a multi‐phased study of prostate cancer

Systems biology approaches to utilise polygenic risk scores for chronic diseases

Leveraging Effect Size Distributions to Improve Polygenic Risk Scores Derived from Summary Statistics of Genome-Wide Association Studies.

The expected polygenic risk score (ePRS) framework: an equitable metric for quantifying polygenetic risk via modeling of ancestral makeup

Optimizing and benchmarking polygenic risk scores with GWAS summary statistics

An Ensemble Penalized Regression Method for Multi-ancestry Polygenic Risk Prediction

Real-time dynamic polygenic prediction for streaming data

An Improved Genome-Wide Polygenic Score Model for Predicting the Risk of Type 2 Diabetes

A data-adaptive Bayesian regression approach for polygenic risk prediction

Integration of Rare Large-Effect Expression Variants Improves Polygenic Risk Prediction