Predicting Chronic Obstructive Pulmonary Disease (COPD) Diagnosis Using Primary Care Variables and Machine Learning Algorithms

Dheiver Francisco Santos
DOI: https://doi.org/10.1101/2024.11.10.24317053
2024-11-10
Abstract:Chronic Obstructive Pulmonary Disease (COPD) affects many adults over the age of 50. Part of its incidence in the population is attributed to work and exposure to risk factors such as silica dust, and anticipating the diagnosis can prevent its worsening. This study aims to identify patients at higher risk of having a positive COPD diagnosis using variables routinely collected in primary care. A total of 120,294 participants from the UK Biobank database, recruited between 2006 and 2010, were analyzed. Of these, 1,837 (1.5%) had a positive COPD diagnosis. A total of 20 variables, including demographic data, laboratory tests, habits, and symptoms, were selected to build predictive models of COPD using five machine learning algo- rithms (artificial neural networks, extra trees, random forests, catboost, and extreme gradient boosting). Additionally, a subset of 7,628 participants with a history in the construction and mining industries was selected to train a specialized model. Among them, 248 (3.25%) had a positive diagnosis. Data were randomly divided, with 70% allocated for training the models and 30% for performance testing. Both models showed good predictive performance. The general model achieved an AUC of 0.847, sensitivity of 0.786, and specificity of 0.765. In the specialist model, an AUC of 0.830, sensitivity of 0.773, and specificity of 0.773 were obtained. The five main predictive variables were chronic cough, age, history of asthma, sputum production, and tobacco exposure. The results demonstrate that it is possible to predict the individual risk of COPD diagnosis using variables commonly collected in primary care
Health Informatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to predict the diagnosis of chronic obstructive pulmonary disease (COPD) by using variables routinely collected in primary care and machine - learning algorithms. Specifically, the study aims to identify patients who are at a higher risk of having COPD for early diagnosis and intervention, thereby preventing the deterioration of the disease. The study specifically focuses on the impact of risk factors in the work environment, such as quartz dust exposure, on the incidence of COPD. ### Research Background and Objectives Chronic obstructive pulmonary disease (COPD) is a disease that affects many adults over the age of 50. Part of its incidence in the population is attributed to occupational exposure to risk factors, such as quartz dust. Early diagnosis is crucial for preventing the deterioration of the disease. Therefore, the objective of this study is to use variables routinely collected in primary care to identify patients who are at a higher risk of being diagnosed with COPD. ### Methods 1. **Data Sources**: - The study used data from the UK Biobank, covering 120,294 participants recruited between 2006 and 2010, of which 1,837 (1.5%) were diagnosed with COPD. - From these participants, 7,628 participants with a history in the construction and mining industries were selected to train a specialized model, of which 248 (3.25%) were diagnosed with COPD. 2. **Variable Selection**: - Twenty variables, including demographic data, laboratory test results, habits, and symptoms, were selected to construct a prediction model for COPD. 3. **Machine - Learning Algorithms**: - Five machine - learning algorithms were used: artificial neural network (ANN), extra trees, random forests, CatBoost, and extreme gradient boosting (XGBoost). 4. **Model Training and Evaluation**: - The data were randomly divided into a 70% training set and a 30% test set. - AUC (area under the curve), sensitivity, and specificity were used as evaluation metrics. ### Results - **General Model**: - AUC: 0.847 - Sensitivity: 0.786 - Specificity: 0.765 - **Specialized Model**: - AUC: 0.830 - Sensitivity: 0.773 - Specificity: 0.773 ### Main Predictor Variables - Chronic cough - Age - History of asthma - Sputum production - Smoking exposure ### Discussion The research results show that using variables routinely collected in primary care combined with machine - learning algorithms can effectively predict the individual risk of COPD. In particular, variables such as smoking exposure, chronic cough, and age play an important role in the prediction. In addition, the specialized model for high - risk occupational groups further improves the prediction performance, indicating that occupational exposure (such as quartz dust) has a significant effect on improving the model performance. ### Conclusion The research results show that based on variables routinely collected in primary care, combined with computer vision technology to analyze chest X - ray and CT scan images, the individual risk of COPD can be effectively predicted. This comprehensive method can be a valuable resource for early intervention in COPD patients, especially for high - risk occupational groups.