A Framework for Evaluating Appropriateness, Trustworthiness, and Safety in Mental Wellness AI Chatbots

Lucia Chen,David A. Preece,Pilleriin Sikka,James J. Gross,Ben Krause
2024-07-16
Abstract:Large language model (LLM) chatbots are susceptible to biases and hallucinations, but current evaluations of mental wellness technologies lack comprehensive case studies to evaluate their practical applications. Here, we address this gap by introducing the MHealth-EVAL framework, a new role-play based interactive evaluation method designed specifically for evaluating the appropriateness, trustworthiness, and safety of mental wellness chatbots. We also introduce Psyfy, a new chatbot leveraging LLMs to facilitate transdiagnostic Cognitive Behavioral Therapy (CBT). We demonstrate the MHealth-EVAL framework's utility through a comparative study of two versions of Psyfy against standard baseline chatbots. Our results showed that Psyfy chatbots outperformed the baseline chatbots in delivering appropriate responses, engaging users, and avoiding untrustworthy responses. However, both Psyfy and the baseline chatbots exhibited some limitations, such as providing predominantly US-centric resources. While Psyfy chatbots were able to identify most unsafe situations and avoid giving unsafe responses, they sometimes struggled to recognize subtle harmful intentions when prompted in role play scenarios. Our study demonstrates a practical application of the MHealth-EVAL framework and showcases Psyfy's utility in harnessing LLMs to enhance user engagement and provide flexible and appropriate responses aligned with an evidence-based CBT approach.
Human-Computer Interaction
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that the current evaluation methods for mental health chatbots lack comprehensive case studies and are unable to effectively evaluate the appropriateness, credibility, and safety of these chatbots in practical applications. Specifically, the paper points out: 1. **Appropriateness**: Whether the responses generated by the chatbot are in line with specific situations and can effectively guide users to conduct self - reflection and emotion regulation. 2. **Credibility**: Whether the information provided by the chatbot is accurate and reliable, especially regarding mental health education content and resource recommendations. 3. **Safety**: Whether the chatbot can avoid providing harmful advice when dealing with high - risk situations (such as suicidal ideation, domestic violence, etc.) and guide users to seek professional help. To address these issues, the paper introduces the MHealth - EVAL framework, an interactive evaluation method based on role - playing, specifically designed to evaluate the above three key aspects of mental health chatbots. In addition, the paper also introduces a new chatbot, Psyfy, which utilizes large - language models (LLMs) to provide cross - diagnostic treatment based on cognitive - behavioral therapy (CBT). By comparing the performance of different versions of Psyfy with that of the standard baseline chatbot, the paper demonstrates the practicality and effectiveness of the MHealth - EVAL framework. Research shows that Psyfy is superior to the baseline chatbot in providing appropriate responses, engaging users, and avoiding uncredible responses, but still has some limitations in identifying subtle harmful intentions.