Integrative machine learning approaches for predicting disease risk using multi-omics data from the UK Biobank

Oscar Aguilar,Cheng Chang,Elsa Bismuth,Manuel A. Rivas
DOI: https://doi.org/10.1101/2024.04.16.589819
2024-04-20
Abstract:We train prediction and survival models using multi-omics data for disease risk identification and stratification. Existing work on disease prediction focuses on risk analysis using datasets of individual data types (metabolomic, genomics, demographic), while our study creates an integrated model for disease risk assessment. We compare machine learning models such as Lasso Regression, Multi-Layer Perceptron, XG Boost, and ADA Boost to analyze multi-omics data, incorporating ROC-AUC score comparisons for various diseases and feature combinations. Additionally, we train Cox proportional hazard models for each disease to perform survival analysis. Although the integration of multi-omics data significantly improves risk prediction for 8 diseases, we find that the contribution of metabolomic data is marginal when compared to standard demographic, genetic, and biomarker features. Nonetheless, we see that metabolomics is a useful replacement for the standard biomarker panel when it is not readily available.
Genomics
What problem does this paper attempt to address?
This paper discusses the use of integrated machine learning methods to predict disease risk using multiple omics data, including phenomics, genomics, metabolomics, and clinical biomarkers. The main objective of the study is to establish a comprehensive model to improve disease risk assessment, rather than relying on a single type of data. By comparing the performance of different machine learning models such as Lasso regression, multilayer perceptron, XG Boost, and ADA Boost on multiple diseases, the authors found that XG Boost has the fastest training speed, ADA Boost has the sparsest feature selection, and Lasso regression has the best classification performance. Metabolomics data contributes less to the prediction of certain diseases, but it can serve as an alternative when standard biomarker data is not available. The study also includes survival analysis, using the Cox proportional hazards model to evaluate disease risk over time. The results show that adding different types of data (genomics and biomarker data) to the baseline phenotypic model significantly improves the predictive performance of certain diseases. However, compared to standard biomarkers, metabolomics data has limited improvement on model performance in most cases. In conclusion, this paper aims to address the issue of integrating multi-omics data to predict disease risk more effectively, and demonstrates through empirical research the relative importance and interaction of different types of data in disease prediction.