Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools

Jung In Park,Mahyar Abbasian,Iman Azimi,Dawn Bounds,Angela Jun,Jaesu Han,Robert McCarron,Jessica Borelli,Jia Li,Mona Mahmoudi,Carmen Wiedenhoeft,Amir Rahmani
2024-08-04
Abstract:Objective: This study aims to develop and validate an evaluation framework to ensure the safety and reliability of mental health chatbots, which are increasingly popular due to their accessibility, human-like interactions, and context-aware support. Materials and Methods: We created an evaluation framework with 100 benchmark questions and ideal responses, and five guideline questions for chatbot responses. This framework, validated by mental health experts, was tested on a GPT-3.5-turbo-based chatbot. Automated evaluation methods explored included large language model (LLM)-based scoring, an agentic approach using real-time data, and embedding models to compare chatbot responses against ground truth standards. Results: The results highlight the importance of guidelines and ground truth for improving LLM evaluation accuracy. The agentic method, dynamically accessing reliable information, demonstrated the best alignment with human assessments. Adherence to a standardized, expert-validated framework significantly enhanced chatbot response safety and reliability. Discussion: Our findings emphasize the need for comprehensive, expert-tailored safety evaluation metrics for mental health chatbots. While LLMs have significant potential, careful implementation is necessary to mitigate risks. The superior performance of the agentic approach underscores the importance of real-time data access in enhancing chatbot reliability. Conclusion: The study validated an evaluation framework for mental health chatbots, proving its effectiveness in improving safety and reliability. Future work should extend evaluations to accuracy, bias, empathy, and privacy to ensure holistic assessment and responsible integration into healthcare. Standardized evaluations will build trust among users and professionals, facilitating broader adoption and improved mental health support through technology.
Computation and Language,Artificial Intelligence,Human-Computer Interaction,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of safety assessment related to mental health chatbots. Specifically, the goal of the research is to develop and validate an evaluation framework to ensure the safety and reliability of such chatbots. Given the increasing popularity of mental health chatbots due to their ease of use, human-like interaction, and context-aware support, it is particularly important to ensure that they adhere to high safety standards when providing services. To achieve this goal, the researchers designed an evaluation framework that includes 100 benchmark questions and their ideal answers, and established five guiding principles to assess the chatbot's responses. These benchmark questions cover various clinical scenarios, including but not limited to mental health crises, anxiety attacks, depressive symptoms, and more. Additionally, the study explored three automated evaluation methods: scoring based on large language models (LLM), proxy methods, and similarity testing using embedding models. The experiments found that the proxy method and embedding model method performed best in terms of accuracy, closely matching the evaluations of human experts. In particular, the proxy method achieved results closest to human assessments by accessing real-time data and reliable information sources. The findings underscore the importance of developing comprehensive and expert-customized safety evaluation metrics, which are crucial for ensuring the safety of mental health chatbots. In summary, the main contribution of this paper is the proposal of a standardized evaluation system for assessing the safety and reliability of mental health chatbots. This is significant for promoting the responsible application of technology and enhancing the trust of users and professionals. Future research can further expand to aspects such as accuracy, bias, empathy, and privacy to ensure a comprehensive evaluation and promote the broader application of mental health support technologies.