Response Generation for Cognitive Behavioral Therapy with Large Language Models: Comparative Study with Socratic Questioning

Kenta Izumi,Hiroki Tanaka,Kazuhiro Shidara,Hiroyoshi Adachi,Daisuke Kanayama,Takashi Kudo,Satoshi Nakamura
2024-01-29
Abstract:Dialogue systems controlled by predefined or rule-based scenarios derived from counseling techniques, such as cognitive behavioral therapy (CBT), play an important role in mental health apps. Despite the need for responsible responses, it is conceivable that using the newly emerging LLMs to generate contextually relevant utterances will enhance these apps. In this study, we construct dialogue modules based on a CBT scenario focused on conventional Socratic questioning using two kinds of LLMs: a Transformer-based dialogue model further trained with a social media empathetic counseling dataset, provided by Osaka Prefecture (OsakaED), and GPT-4, a state-of-the art LLM created by OpenAI. By comparing systems that use LLM-generated responses with those that do not, we investigate the impact of generated responses on subjective evaluations such as mood change, cognitive change, and dialogue quality (e.g., empathy). As a result, no notable improvements are observed when using the OsakaED model. When using GPT-4, the amount of mood change, empathy, and other dialogue qualities improve significantly. Results suggest that GPT-4 possesses a high counseling ability. However, they also indicate that even when using a dialogue model trained with a human counseling dataset, it does not necessarily yield better outcomes compared to scenario-based dialogues. While presenting LLM-generated responses, including GPT-4, and having them interact directly with users in real-life mental health care services may raise ethical issues, it is still possible for human professionals to produce example responses or response templates using LLMs in advance in systems that use rules, scenarios, or example responses.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to address whether using large language models (LLMs) to generate dialogue responses in Cognitive Behavioral Therapy (CBT) can improve the system's effectiveness and user experience. Specifically, the researchers constructed dialogue modules based on CBT scenarios using two types of LLMs: one is a Transformer dialogue model (OsakaED) further trained on a social media empathy counseling dataset, and the other is a state-of-the-art LLM created by OpenAI (GPT-4). By comparing systems that use LLM-generated responses with those that do not, the researchers investigated the impact of generated responses on subjective evaluations, including emotional changes, cognitive changes, and dialogue quality (such as empathy). ### Main Questions: 1. **Can responses generated by LLMs improve the effectiveness of CBT dialogue systems?** - The researchers evaluated the performance of different dialogue systems in terms of user emotional changes, cognitive changes, and dialogue quality through experiments. 2. **How do different LLMs (OsakaED and GPT-4) perform in CBT dialogues?** - The researchers compared the effects of OsakaED and GPT-4 in generating responses, particularly in terms of emotional changes, empathy, and dialogue quality. 3. **Can combining Socratic questioning with LLM-generated responses further enhance the system's effectiveness?** - The researchers explored the effect of combining Socratic questioning with LLM-generated responses, particularly whether this combination can significantly improve the user experience. ### Experimental Design: - **Dialogue Scenarios**: The researchers created a CBT dialogue scenario containing 15 system responses, including Socratic questioning. - **Dialogue Systems**: The researchers implemented five different dialogue systems, which are: - **SQ**: Using only Socratic questioning. - **OsakaED**: Using responses generated by OsakaED. - **OsakaED+SQ**: Combining responses generated by OsakaED with Socratic questioning. - **GPT-4**: Using responses generated by GPT-4. - **GPT-4+SQ**: Combining responses generated by GPT-4 with Socratic questioning. ### Results: - **GPT-4**: GPT-4 performed the best on most evaluation metrics, especially in terms of emotional changes, empathy, and dialogue quality. - **OsakaED**: Although OsakaED performed well on some metrics, its overall effectiveness was not as good as GPT-4. - **Adding Socratic Questioning**: Adding Socratic questioning on top of OsakaED and GPT-4 did not significantly improve the user experience, possibly because the LLMs themselves were already capable of generating sufficiently rich and engaging responses. ### Conclusion: - Responses generated by GPT-4 can significantly improve the effectiveness and user experience of CBT dialogue systems, particularly in terms of emotional changes and empathy. - Simply adding Socratic questioning does not necessarily significantly improve the user experience, indicating that LLMs themselves already have the capability to generate high-quality responses.