Abstract:Objective: This study leverages the rich diversity of the All of Us Research Program (All of Us)'s dataset to devise a predictive model for cardiovascular disease (CVD) in breast cancer (BC) survivors. Central to this endeavor is the creation of a robust data integration pipeline that synthesizes electronic health records (EHRs), patient surveys, and genomic data, while upholding fairness across demographic variables. Materials and methods: We have developed a universal data wrangling pipeline to process and merge heterogeneous data sources of the All of Us dataset, address missingness and variance in data, and align disparate data modalities into a coherent framework for analysis. Utilizing a composite feature set including EHR, lifestyle, and social determinants of health (SDoH) data, we then employed Adaptive Lasso and Random Forest regression models to predict 6 CVD outcomes. The models were evaluated using the c-index and time-dependent Area Under the Receiver Operating Characteristic Curve over a 10-year period. Results: The Adaptive Lasso model showed consistent performance across most CVD outcomes, while the Random Forest model excelled particularly in predicting outcomes like transient ischemic attack when incorporating the full multi-model feature set. Feature importance analysis revealed age and previous coronary events as dominant predictors across CVD outcomes, with SDoH clustering labels highlighting the nuanced impact of social factors. Discussion: The development of both Cox-based predictive model and Random Forest Regression model represents the extensive application of the All of Us, in integrating EHR and patient surveys to enhance precision medicine. And the inclusion of SDoH clustering labels revealed the significant impact of sociobehavioral factors on patient outcomes, emphasizing the importance of comprehensive health determinants in predictive models. Despite these advancements, limitations include the exclusion of genetic data, broad categorization of CVD conditions, and the need for fairness analyses to ensure equitable model performance across diverse populations. Future work should refine clinical and social variable measurements, incorporate advanced imputation techniques, and explore additional predictive algorithms to enhance model precision and fairness. Conclusion: This study demonstrates the liability of the All of Us's diverse dataset in developing a multi-modality predictive model for CVD in BC survivors risk stratification in oncological survivorship. The data integration pipeline and subsequent predictive models establish a methodological foundation for future research into personalized healthcare.

Integrative machine learning approaches for predicting disease risk using multi-omics data from the UK Biobank

Disease prediction with multi-omics and biomarkers empowers case-control genetic discoveries in the UK Biobank

Integrating Multi-Organ Imaging-Derived Phenotypes and Genomic Information for Predicting the Occurrence of Common Diseases

Multi-modality risk prediction of cardiovascular diseases for breast cancer cohort in the All of Us Research Program

Interpretable Machine Learning Leverages Proteomics to Improve Cardiovascular Disease Risk Prediction and Biomarker Identification

Machine learning for comprehensive interaction modelling improves disease risk prediction in the UK Biobank

On the combination of omics data for prediction of binary outcomes

Prediction of disease-free survival for precision medicine using cooperative learning on multi-omic data

Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank

Multi-omic prediction of incident type 2 diabetes

Interpretable meta-learning of multi-omics data for survival analysis and pathway enrichment

Does combining numerous data types in multi-omics data improve or hinder performance in survival prediction? Insights from a large-scale benchmark study

Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations

A Machine Learning Model for Disease Risk Prediction by Integrating Genetic and Non-Genetic Factors

Predicting Deep Learning Based Multi-Omics Parallel Integration Survival Subtypes in Lung Cancer Using Reverse Phase Protein Array Data

Integrating Somatic Mutations for Breast Cancer Survival Prediction Using Machine Learning Methods

Integration of multi-omics data for survival prediction of lung adenocarcinoma

Multimodal AI/ML for discovering novel biomarkers and predicting disease using multi-omics profiles of patients with cardiovascular diseases

A comprehensive multi-task deep learning approach for predicting metabolic syndrome with genetic, nutritional, and clinical data

Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank

Exploring machine learning strategies for predicting cardiovascular disease risk factors from multi-omic data