Disparate Model Performance and Stability in Machine Learning Clinical Support for Diabetes and Heart Diseases

Ioannis Bilionis,Ricardo C. Berrios,Luis Fernandez-Luque,Carlos Castillo
2024-12-27
Abstract:Machine Learning (ML) algorithms are vital for supporting clinical decision-making in biomedical informatics. However, their predictive performance can vary across demographic groups, often due to the underrepresentation of historically marginalized populations in training datasets. The investigation reveals widespread sex- and age-related inequities in chronic disease datasets and their derived ML models. Thus, a novel analytical framework is introduced, combining systematic arbitrariness with traditional metrics like accuracy and data complexity. The analysis of data from over 25,000 individuals with chronic diseases revealed mild sex-related disparities, favoring predictive accuracy for males, and significant age-related differences, with better accuracy for younger patients. Notably, older patients showed inconsistent predictive accuracy across seven datasets, linked to higher data complexity and lower model performance. This highlights that representativeness in training data alone does not guarantee equitable outcomes, and model arbitrariness must be addressed before deploying models in clinical settings.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore and solve the problems of performance differences and stability of machine learning (ML) models in clinical support for diabetes and heart diseases. Specifically, the research focuses on the following points: 1. **Imbalance in prediction performance**: There are significant differences in the prediction performance of ML models among different demographic groups (such as gender and age). In particular, historically marginalized groups are under - represented in the training datasets, resulting in lower prediction accuracy for these groups. 2. **Data complexity and system arbitrariness**: The research introduced a new analytical framework, combining traditional performance metrics (such as accuracy) and data complexity measures, to evaluate the stability and fairness of the model. By analyzing data from more than 25,000 chronic disease patients, the study found that gender - related differences were relatively minor, while age - related differences were more significant, especially the prediction accuracy of elderly patients was unstable. 3. **Model fairness**: The research emphasized that even if demographic balance is achieved in the training data, it cannot guarantee the fairness of the model output. There may still be systematic biases in the model's prediction results, especially in elderly patients. Therefore, these problems need to be solved before model deployment to ensure its fairness and reliability in clinical applications. 4. **Effect of data augmentation**: The research also explored the possibility of improving model performance by increasing the amount of data for specific groups and estimated the number of additional data points required to achieve performance balance among different groups. ### Research methods - **Datasets**: Multiple publicly available chronic disease datasets were used, including two diabetes - related datasets (D1, D2) and five heart - disease - related datasets (D3 - D7). Some of the datasets were split into smaller subsets for analysis. - **Model training**: Three gradient - boosting algorithms (XGBoost, LGBoost, HGBoost) were adopted, and three - fold cross - validation and repeated sampling were carried out to generate multiple models for evaluation. - **Performance evaluation**: The area under the receiver operating characteristic curve (AUC) was mainly used as a performance metric, and the impact of additional data on performance was analyzed through learning curve analysis. - **Data complexity analysis**: Multiple complexity metric indicators were calculated to understand the differences in classification accuracy among different groups. - **System arbitrariness analysis**: The self - consistency measure was introduced to evaluate the stability of the model in different groups. ### Main findings - **Gender differences**: About 10% of the validation results showed that the AUC of male patients was higher than that of female patients, while only 1% of the results showed that the AUC of female patients was higher. - **Age differences**: 32% of the validation results showed that the AUC of young patients was higher than that of elderly patients, while in 5% of cases, the AUC of elderly patients was higher. - **Relationship between data complexity and performance**: In some datasets, higher data complexity was associated with lower AUC values, especially in elderly patients. - **System arbitrariness**: Elderly patients showed significant arbitrariness in 4 datasets, that is, different models had inconsistent prediction results for the same patient. ### Conclusions The research shows that the representativeness of training data alone is not sufficient to ensure the fairness of ML models in different populations. Especially in elderly patients, due to increased data complexity and decreased model performance, the prediction results are more unstable and arbitrary. Therefore, before applying ML models to the clinic, the problem of model arbitrariness must be solved to ensure its fairness and reliability. ### Significance This research provides important insights for improving the application of ML models in the medical field, especially in ensuring model fairness and reducing bias. Future research should further explore how to reduce system arbitrariness through technical means and human intervention and improve the performance of models in different populations.