Risk Factor Analysis for Cancer and Coronary Heart Disease: A Machine Learning Approach Using National Health and Nutrition Examination Survey Data

Hakan Bozcuk
DOI: https://doi.org/10.1101/2024.11.05.24316754
2024-11-05
Abstract:Objectives: The relative significance of predictive factors for cancer and coronary heart disease (CHD) is still unclear. This study aims to identify and evaluate the risk factors contributing to the development of both conditions using the CatBoost machine learning algorithm. Methods: Data from twelve datasets of the 2009 to 2010 National Health and Nutrition Examination Survey (NHANES), incorporating both survey responses and laboratory results, were used. Separate CatBoost models were developed to predict cancer and CHD occurrences, by using Shapley Additive Explanations (SHAP), with the help of Recursive Feature Elimination with Cross-Validation (RFECV), and by adjusting class weights, and model performance was assessed using Receiver Operating Characteristic (ROC) curves. Results: The datasets were combined to form a cohort of 5,012 participants, each with 24 selected features. The cancer prediction model achieved a ROC Area Under the Curve (AUC) of 0.76, with 13 selected features, yielding an accuracy of 0.70, sensitivity of 0.67, and specificity of 0.70. In contrast, the CHD prediction model achieved a higher ROC AUC of 0.87, with an accuracy of 0.83, sensitivity of 0.78, and specificity of 0.83. Accordingly, top predictive features for each disease have been ranked and selected by the CatBoost algorithm. Conclusions: This study identifies key demographic and laboratory features significantly associated with cancer and CHD risk in the NHANES dataset. The findings suggest that these factors could be valuable for estimating individual risk and could inform machine learning models aimed at early detection and screening.
Health Informatics
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to identify and evaluate the risk factors for cancer and coronary heart disease (CHD) through machine - learning methods, especially using the CatBoost algorithm. Specifically, the research objectives include: 1. **Identifying risk factors**: Determine which demographic characteristics, behavioral habits, and biological indicators are related to the occurrence of cancer and coronary heart disease. 2. **Evaluating prediction performance**: Use the CatBoost model to predict the occurrence of cancer and coronary heart disease and evaluate the performance of the model, including accuracy, sensitivity, and specificity. 3. **Feature selection**: Select the most important predictive features through recursive feature elimination cross - validation (RFE - CV) and Shapley additive explanations (SHAP) methods. 4. **Comparing risk factors between diseases**: Explore whether the risk factors for cancer and coronary heart disease overlap or are different. ### Research background Cancer and coronary heart disease are major global health problems, and both are especially prevalent in the elderly population. Although many predictive factors related to these two diseases have been identified, the interaction mechanisms between them and the specific pathways of co - morbidity are still not fully understood. In addition, there are also specific predisposing factors in the patient population with both diseases, which have not been fully studied. ### Methods 1. **Data sources**: The study used 12 datasets from the National Health and Nutrition Examination Survey (NHANES) in 2009 - 2010, including questionnaire survey results and laboratory test results. 2. **Data processing**: These datasets were combined into a cohort of 5,012 participants, with each participant having 24 selected features. 3. **Model development**: CatBoost models for predicting cancer and coronary heart disease were developed separately, with category weights adjusted and feature selection performed using RFE - CV. 4. **Performance evaluation**: The area under the receiver operating characteristic curve (ROC) (AUC) was used to evaluate the performance of the model. ### Results 1. **Model performance**: - The AUC value of the cancer prediction model was 0.76, the accuracy was 0.70, the sensitivity was 0.67, and the specificity was 0.70. - The AUC value of the coronary heart disease prediction model was 0.87, the accuracy was 0.83, the sensitivity was 0.78, and the specificity was 0.83. 2. **Important features**: - **Cancer model**: Age, gender, financial status (income from interest, dividends, or rent), neutrophil - to - lymphocyte ratio (NLR), and glycated hemoglobin (HbA1c) levels are the most important factors. - **Coronary heart disease model**: Age, gender, platelet count, family history of coronary heart disease, and red blood cell distribution width (RDW) are the most important factors. ### Conclusions This study identified key demographic and laboratory characteristics significantly associated with the risk of cancer and coronary heart disease through the NHANES dataset. These findings are helpful for estimating individual risks and may provide valuable information for early detection and screening. However, the precision of the model is low, and further clinical evaluation and diagnostic tests are required to verify positive prediction results. Future research can improve the predictive ability of the model by integrating genomic data.