How group structure impacts the numbers at risk for coronary artery disease: polygenic risk scores and non-genetic risk factors in the UK Biobank cohort

Jinbo Zhao,Adrian O'Hagan,Michael Salter-Townshend
DOI: https://doi.org/10.1093/genetics/iyae086
IF: 4.402
2024-05-24
Genetics
Abstract:The UK Biobank is a large cohort study that recruited over 500,000 British participants aged 40-69 in 2006-2010 at 22 assessment centres from across the UK. Self-reported health outcomes and hospital admission data are two types of records that include participants' disease status. Coronary artery disease (CAD) is the most common cause of death in the UK Biobank cohort. After distinguishing between prevalence and incidence CAD events for all UK Biobank participants, we identified geographical variations in age-standardised rates of CAD between assessment centres. Significant distributional differences were found between the pooled cohort equation scores of UK Biobank participants from England and Scotland using the Mann-Whitney test. Polygenic risk scores of UK Biobank participants from England and Scotland and from different assessment centres differed significantly using permutation tests. Our aim was to discriminate between assessment centres with different disease rates by collecting data on disease-related risk factors. However, relying solely on individual-level predictions and averaging them to obtain group-level predictions proved ineffective, particularly due to the presence of correlated covariates resulting from participation bias. By using the Mundlak model, which estimates a random effects regression by including the group means of the independent variables in the model, we effectively addressed these issues. In addition, we designed a simulation experiment to demonstrate the functionality of the Mundlak model. Our findings have applications in public health funding and strategy, as our approach can be used to predict case rates in the future, as both population structure and lifestyle changes are uncertain.
genetics & heredity
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Risk Prediction of Coronary Artery Disease (CAD)**: By combining genetic risk factors (polygenic risk score, PRS) and non-genetic risk factors (such as the pooled cohort equations score, PCE), the study aims to improve the risk prediction of coronary artery disease at the individual level. The research found that adding PRS to the PCE model can correct the overestimation of risk and improve prediction accuracy. 2. **Geographical Differences and Regional Risk Prediction**: The study examined the differences in CAD prevalence between different regions of the UK (England and Scotland) and explored whether these differences are due to environmental or genetic factors. The results showed that although PRS cannot fully explain this difference, adjusting for the mean of within-group covariates can significantly improve the estimation of regional risk. 3. **Impact of Population Structure on Risk Prediction**: A method based on the Mundlak model was proposed to address the poor performance of traditional regression models in predicting population-level risk due to participation bias. By incorporating group-specific covariate means, the accuracy of population-level risk estimation was improved. 4. **Predicting Future CAD Incidence Rates**: The research methods can be used not only to understand the mechanisms of CAD occurrence but also to predict future CAD incidence rates based on changes in environmental factors and the genetic structure of the population. This has important implications for public health funding allocation and strategy formulation. In summary, the paper aims to improve the regional-level risk prediction of coronary artery disease by comprehensively considering genetic and non-genetic factors and utilizing improved statistical models to analyze population structure.