Machine Learning Approaches to Predict Alcohol Consumption from Biomarkers in the UK Biobank

Mohammed F. Hassan,Amanda Elswick Gentry,Elizabeth C. Prom-Wormley,Roseann E. Peterson,Bradley T. Webb
DOI: https://doi.org/10.1101/2024.12.22.24319486
2024-12-24
Abstract:Background: Measuring and estimating alcohol consumption (AC) is important for individual health, public health, and Societal benefits. While self-report and diagnostic interviews are commonly used, incorporating biological-based indices can offer a complementary approach. Methods: We evaluate machine learning (ML) based predictions of AC using blood and urine-derived biomarkers. This research has been conducted using the UK Biobank (UKB) Resource. In addition to the prediction of the number of alcoholic Drinks Per Week (DPW), four other related phenotypes were predicted for performance comparison. Five ML models were assessed including LASSO, Ridge regression, Gradient Boosting Machines (GBM), Model Boosting (MBOOST), and Extreme Gradient Boosting (XGBOOST). Results: All five ML methods achieved moderate prediction of DPW (r2=0.304-0.356) with biomarkers significantly increasing prediction above using only known covariates and liver enzymes (r2=0.105). XGBOOST achieved the best prediction performance (r2=0.356, MAE=5.214) at the expense of increasing model complexity and training resources compared to other ML methods. All ML models were able to accurately predict if subjects were heavy drinkers (DPW>8 for women and DPW>15 for men) and produced explainable models that highlighted the role of biomarkers in predicting DPW. While phenotype correlations were similar across methods, XGBOOST produced similar heritability estimates for observed (h2=0.064) and predicted (h2=0.077) DPW. The estimated genetic correlation between observed and predicted DPW was 0.877. Conclusions: Predicting AC from ML-based biological measures provides an opportunity to identify individuals at increased risk of heavy AC, thereby offering complementary avenue for risk assessment beyond self-report, screening instruments, or structured interviews, which have some known biases. In addition, explainable AI tools identified a constellation of biomarkers associated with AC.
What problem does this paper attempt to address?