Assessing the Reliability of Machine Learning Models Applied to the Mental Health Domain Using Explainable AI

Vishnu Pendyala,Hyungkyun Kim
DOI: https://doi.org/10.3390/electronics13061025
IF: 2.9
2024-03-09
Electronics
Abstract:Machine learning is increasingly and ubiquitously being used in the medical domain. Evaluation metrics like accuracy, precision, and recall may indicate the performance of the models but not necessarily the reliability of their outcomes. This paper assesses the effectiveness of a number of machine learning algorithms applied to an important dataset in the medical domain, specifically, mental health, by employing explainability methodologies. Using multiple machine learning algorithms and model explainability techniques, this work provides insights into the models' workings to help determine the reliability of the machine learning algorithm predictions. The results are not intuitive. It was found that the models were focusing significantly on less relevant features and, at times, unsound ranking of the features to make the predictions. This paper therefore argues that it is important for research in applied machine learning to provide insights into the explainability of models in addition to other performance metrics like accuracy. This is particularly important for applications in critical domains such as healthcare.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily explores the reliability and interpretability of machine learning models in the field of mental health. Specifically: 1. **Effectiveness of Evaluation Metrics**: The study aims to assess whether traditional evaluation metrics (such as accuracy, precision, and recall) are sufficient to measure the performance of machine learning models in mental health prediction. 2. **Application of Explainable AI Techniques**: The paper employs two popular explainable AI techniques—Local Interpretable Model-agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP)—to complement traditional evaluation metrics, thereby better understanding the decision-making process of the models. 3. **Comparison of Different Algorithms**: Experiments are conducted on datasets using various machine learning algorithms (such as logistic regression, K-nearest neighbors, decision trees, etc.), and LIME and SHAP are utilized to analyze the interpretability of these models. The core research questions are: - **RQ1**: How reliable are evaluation metrics like accuracy in assessing the performance of machine learning models? - **RQ2**: How do explainable AI techniques like LIME and SHAP complement traditional evaluation metrics? - **RQ3**: What are the differences in result interpretability among various machine learning algorithms? The paper demonstrates through experiments that relying solely on traditional evaluation metrics can lead to misleading conclusions, especially in mental health prediction. For instance, some models may overly focus on irrelevant or unreasonable features, thereby affecting the reliability of predictions. Therefore, the paper emphasizes that when applying machine learning in critical fields such as healthcare, it is essential to consider not only performance metrics but also the interpretability of the models.