Abstract:Machine Learning (ML) algorithms are vital for supporting clinical decision-making in biomedical informatics. However, their predictive performance can vary across demographic groups, often due to the underrepresentation of historically marginalized populations in training datasets. The investigation reveals widespread sex- and age-related inequities in chronic disease datasets and their derived ML models. Thus, a novel analytical framework is introduced, combining systematic arbitrariness with traditional metrics like accuracy and data complexity. The analysis of data from over 25,000 individuals with chronic diseases revealed mild sex-related disparities, favoring predictive accuracy for males, and significant age-related differences, with better accuracy for younger patients. Notably, older patients showed inconsistent predictive accuracy across seven datasets, linked to higher data complexity and lower model performance. This highlights that representativeness in training data alone does not guarantee equitable outcomes, and model arbitrariness must be addressed before deploying models in clinical settings.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore and solve the problems of performance differences and stability of machine learning (ML) models in clinical support for diabetes and heart diseases. Specifically, the research focuses on the following points: 1. **Imbalance in prediction performance**: There are significant differences in the prediction performance of ML models among different demographic groups (such as gender and age). In particular, historically marginalized groups are under - represented in the training datasets, resulting in lower prediction accuracy for these groups. 2. **Data complexity and system arbitrariness**: The research introduced a new analytical framework, combining traditional performance metrics (such as accuracy) and data complexity measures, to evaluate the stability and fairness of the model. By analyzing data from more than 25,000 chronic disease patients, the study found that gender - related differences were relatively minor, while age - related differences were more significant, especially the prediction accuracy of elderly patients was unstable. 3. **Model fairness**: The research emphasized that even if demographic balance is achieved in the training data, it cannot guarantee the fairness of the model output. There may still be systematic biases in the model's prediction results, especially in elderly patients. Therefore, these problems need to be solved before model deployment to ensure its fairness and reliability in clinical applications. 4. **Effect of data augmentation**: The research also explored the possibility of improving model performance by increasing the amount of data for specific groups and estimated the number of additional data points required to achieve performance balance among different groups. ### Research methods - **Datasets**: Multiple publicly available chronic disease datasets were used, including two diabetes - related datasets (D1, D2) and five heart - disease - related datasets (D3 - D7). Some of the datasets were split into smaller subsets for analysis. - **Model training**: Three gradient - boosting algorithms (XGBoost, LGBoost, HGBoost) were adopted, and three - fold cross - validation and repeated sampling were carried out to generate multiple models for evaluation. - **Performance evaluation**: The area under the receiver operating characteristic curve (AUC) was mainly used as a performance metric, and the impact of additional data on performance was analyzed through learning curve analysis. - **Data complexity analysis**: Multiple complexity metric indicators were calculated to understand the differences in classification accuracy among different groups. - **System arbitrariness analysis**: The self - consistency measure was introduced to evaluate the stability of the model in different groups. ### Main findings - **Gender differences**: About 10% of the validation results showed that the AUC of male patients was higher than that of female patients, while only 1% of the results showed that the AUC of female patients was higher. - **Age differences**: 32% of the validation results showed that the AUC of young patients was higher than that of elderly patients, while in 5% of cases, the AUC of elderly patients was higher. - **Relationship between data complexity and performance**: In some datasets, higher data complexity was associated with lower AUC values, especially in elderly patients. - **System arbitrariness**: Elderly patients showed significant arbitrariness in 4 datasets, that is, different models had inconsistent prediction results for the same patient. ### Conclusions The research shows that the representativeness of training data alone is not sufficient to ensure the fairness of ML models in different populations. Especially in elderly patients, due to increased data complexity and decreased model performance, the prediction results are more unstable and arbitrary. Therefore, before applying ML models to the clinic, the problem of model arbitrariness must be solved to ensure its fairness and reliability. ### Significance This research provides important insights for improving the application of ML models in the medical field, especially in ensuring model fairness and reducing bias. Future research should further explore how to reduce system arbitrariness through technical means and human intervention and improve the performance of models in different populations.

Disparate Model Performance and Stability in Machine Learning Clinical Support for Diabetes and Heart Diseases

Sex-Based Performance Disparities in Machine Learning Algorithms for Cardiac Disease Prediction: Exploratory Study

Fairness gaps in Machine learning models for hospitalization and emergency department visit risk prediction in home healthcare patients with heart failure

Equity in Healthcare: Analyzing Disparities in Machine Learning Predictions of Diabetic Patient Readmissions

Assessing fairness in machine learning models: A study of racial bias using matched counterparts in mortality prediction for patients with chronic diseases

Assessing Social Determinants-Related Performance Bias of Machine Learning Models: A case of Hyperchloremia Prediction in ICU Population

An Empirical Characterization of Fair Machine Learning For Clinical Risk Prediction

Addressing Class Imbalance in Healthcare Data: Machine Learning Solutions for Age-Related Macular Degeneration and Preeclampsia

The Sociodemographic Biases in Machine Learning Algorithms: A Biomedical Informatics Perspective

Race, Sex, and Age Disparities in the Performance of ECG Deep Learning Models Predicting Heart Failure

Disseminating the Risk Factors With Enhancement in Precision Medicine Using Comparative Machine Learning Models for Healthcare Data

Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data

Conceptualizing bias in EHR data: A case study in performance disparities by demographic subgroups for a pediatric obesity incidence classifier

Fair Machine Learning for Healthcare Requires Recognizing the Intersectionality of Sociodemographic Factors, a Case Study

Evaluating gender bias in ML-based clinical risk prediction models: A study on multiple use cases at different hospitals

Algorithmic encoding of protected characteristics in image-based models for disease detection

Intersectional consequences for marginal fairness in prediction models of emergency admissions

Comparison of Machine Learning Classification Algorithms and Application to the Framingham Heart Study

Adapting Machine Learning Diagnostic Models to New Populations Using a Small Amount of Data: Results from Clinical Neuroscience

A machine learning model for predicting, diagnosing, and mitigating health disparities in hospital readmission

Fairness in Machine Learning Meets with Equity in Healthcare