Machine learning models' assessment: trust and performance

Sousa, S.,Paredes, S.,Rocha, T.,Sousa, J.
DOI: https://doi.org/10.1007/s11517-024-03145-5
2024-06-09
Medical & Biological Engineering & Computing
Abstract:The common black box nature of machine learning models is an obstacle to their application in health care context. Their widespread application is limited by a significant "lack of trust." So, the main goal of this work is the development of an evaluation approach that can assess, simultaneously, trust and performance. Trust assessment is based on (i) model robustness (stability assessment), (ii) confidence (95% CI of geometric mean), and (iii) interpretability (comparison of respective features ranking with clinical evidence). Performance is assessed through geometric mean. For validation, in patients' stratification in cardiovascular risk assessment, a Portuguese dataset ( N =1544) was applied. Five different models were compared: (i) GRACE score, the most common risk assessment tool in Portugal for patients with acute coronary syndrome; (ii) logistic regression; (iii) Naïve Bayes; (iv) decision trees; and (v) rule-based approach, previously developed by this team. The obtained results confirm that the simultaneous assessment of trust and performance can be successfully implemented. The rule-based approach seems to have potential for clinical application. It provides a high level of trust in the respective operation while outperformed the GRACE model's performance, enhancing the required physicians' acceptance. This may increase the possibility to effectively aid the clinical decision.
engineering, biomedical,computer science, interdisciplinary applications,mathematical & computational biology,medical informatics
What problem does this paper attempt to address?
The main objective of this paper is to develop an evaluation method that can simultaneously assess the trustworthiness and performance of machine learning models. Specifically: 1. **Trustworthiness Evaluation**: - Model robustness (stability assessment). - Confidence (95% confidence interval of the geometric mean). - Interpretability (comparison of feature ranking with clinical evidence). 2. **Performance Evaluation**: - Performance is assessed by calculating the geometric mean (Gmean). To validate the effectiveness of this evaluation method, a Portuguese cardiovascular disease dataset (N=1544) was selected, and five different models were compared: - GRACE score: The most commonly used risk assessment tool among Portuguese patients with acute coronary syndrome. - Logistic regression. - Bayesian algorithm. - Decision tree. - Rule-based method (previously developed by the research team). The results indicate that it is feasible to simultaneously assess trustworthiness and performance, and the rule-based method has potential advantages in clinical applications. It not only improves trustworthiness in operation but also outperforms the GRACE model in terms of performance, thereby enhancing doctors' acceptance of the model, which aids in clinical decision support.