Abstract:Objectives: The relative significance of predictive factors for cancer and coronary heart disease (CHD) is still unclear. This study aims to identify and evaluate the risk factors contributing to the development of both conditions using the CatBoost machine learning algorithm. Methods: Data from twelve datasets of the 2009 to 2010 National Health and Nutrition Examination Survey (NHANES), incorporating both survey responses and laboratory results, were used. Separate CatBoost models were developed to predict cancer and CHD occurrences, by using Shapley Additive Explanations (SHAP), with the help of Recursive Feature Elimination with Cross-Validation (RFECV), and by adjusting class weights, and model performance was assessed using Receiver Operating Characteristic (ROC) curves. Results: The datasets were combined to form a cohort of 5,012 participants, each with 24 selected features. The cancer prediction model achieved a ROC Area Under the Curve (AUC) of 0.76, with 13 selected features, yielding an accuracy of 0.70, sensitivity of 0.67, and specificity of 0.70. In contrast, the CHD prediction model achieved a higher ROC AUC of 0.87, with an accuracy of 0.83, sensitivity of 0.78, and specificity of 0.83. Accordingly, top predictive features for each disease have been ranked and selected by the CatBoost algorithm. Conclusions: This study identifies key demographic and laboratory features significantly associated with cancer and CHD risk in the NHANES dataset. The findings suggest that these factors could be valuable for estimating individual risk and could inform machine learning models aimed at early detection and screening.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to identify and evaluate the risk factors for cancer and coronary heart disease (CHD) through machine - learning methods, especially using the CatBoost algorithm. Specifically, the research objectives include: 1. **Identifying risk factors**: Determine which demographic characteristics, behavioral habits, and biological indicators are related to the occurrence of cancer and coronary heart disease. 2. **Evaluating prediction performance**: Use the CatBoost model to predict the occurrence of cancer and coronary heart disease and evaluate the performance of the model, including accuracy, sensitivity, and specificity. 3. **Feature selection**: Select the most important predictive features through recursive feature elimination cross - validation (RFE - CV) and Shapley additive explanations (SHAP) methods. 4. **Comparing risk factors between diseases**: Explore whether the risk factors for cancer and coronary heart disease overlap or are different. ### Research background Cancer and coronary heart disease are major global health problems, and both are especially prevalent in the elderly population. Although many predictive factors related to these two diseases have been identified, the interaction mechanisms between them and the specific pathways of co - morbidity are still not fully understood. In addition, there are also specific predisposing factors in the patient population with both diseases, which have not been fully studied. ### Methods 1. **Data sources**: The study used 12 datasets from the National Health and Nutrition Examination Survey (NHANES) in 2009 - 2010, including questionnaire survey results and laboratory test results. 2. **Data processing**: These datasets were combined into a cohort of 5,012 participants, with each participant having 24 selected features. 3. **Model development**: CatBoost models for predicting cancer and coronary heart disease were developed separately, with category weights adjusted and feature selection performed using RFE - CV. 4. **Performance evaluation**: The area under the receiver operating characteristic curve (ROC) (AUC) was used to evaluate the performance of the model. ### Results 1. **Model performance**: - The AUC value of the cancer prediction model was 0.76, the accuracy was 0.70, the sensitivity was 0.67, and the specificity was 0.70. - The AUC value of the coronary heart disease prediction model was 0.87, the accuracy was 0.83, the sensitivity was 0.78, and the specificity was 0.83. 2. **Important features**: - **Cancer model**: Age, gender, financial status (income from interest, dividends, or rent), neutrophil - to - lymphocyte ratio (NLR), and glycated hemoglobin (HbA1c) levels are the most important factors. - **Coronary heart disease model**: Age, gender, platelet count, family history of coronary heart disease, and red blood cell distribution width (RDW) are the most important factors. ### Conclusions This study identified key demographic and laboratory characteristics significantly associated with the risk of cancer and coronary heart disease through the NHANES dataset. These findings are helpful for estimating individual risks and may provide valuable information for early detection and screening. However, the precision of the model is low, and further clinical evaluation and diagnostic tests are required to verify positive prediction results. Future research can improve the predictive ability of the model by integrating genomic data.

Risk Factor Analysis for Cancer and Coronary Heart Disease: A Machine Learning Approach Using National Health and Nutrition Examination Survey Data

Improving Cardiovascular Risk Prediction Through Machine Learning Modelling of Irregularly Repeated Electronic Health Records

Novel Machine Learning Algorithm in Risk Prediction Model for Pan-Cancer Risk: Application in a Large Prospective Cohort

Identifying Cancer Patients at Risk for Heart Failure Using Machine Learning Methods

Using machine learning-based algorithms to construct cardiovascular risk prediction models for Taiwanese adults based on traditional and novel risk factors

Use machine learning models to identify and assess risk factors for coronary artery disease

Application of machine learning algorithms to construct and validate a prediction model for coronary heart disease risk in patients with periodontitis: a population-based study

Two-level boosting classifiers ensemble based on feature selection for heart disease prediction

Abstract P369: Using Machine Learning to Predict Unplanned Readmission Due to Cardiovascular Disease Among Hospitalized Patients With Cancer

Construction and Validation of a Predictive Model for Coronary Artery Disease Using Extreme Gradient Boosting

Using Machine Learning Techniques to Identify Key Risk Factors for Diabetes and Undiagnosed Diabetes

Early prediction model for coronary heart disease using genetic algorithms, hyper-parameter optimization and machine learning techniques

Nonlaboratory-based risk assessment model for coronary heart disease screening: Model development and validation

A Machine Learning Model Based on Genetic and Traditional Cardiovascular Risk Factors to Predict Premature Coronary Artery Disease

Machine learning to predict hemodynamically significant CAD based on traditional risk factors, coronary artery calcium and epicardial fat volume

Predicting coronary heart disease in Chinese diabetics using machine learning

Disseminating the Risk Factors With Enhancement in Precision Medicine Using Comparative Machine Learning Models for Healthcare Data

Machine Learning and Real-World Data to Predict Lung Cancer Risk in Routine Care

Using a machine learning-based risk prediction model to analyze the coronary artery calcification score and predict coronary heart disease and risk assessment

Prediction of Breast Cancer using Machine Learning Approaches

A proposed technique for predicting heart disease using machine learning algorithms and an explainable AI method