Comparing the Efficacy of GPT-4 and Chat-GPT in Mental Health Care: A Blind Assessment of Large Language Models for Psychological Support

Birger Moell
2024-05-15
Abstract:Background: Rapid advancements in natural language processing have led to the development of large language models with the potential to revolutionize mental health care. These models have shown promise in assisting clinicians and providing support to individuals experiencing various psychological challenges.
Computation and Language,Artificial Intelligence,Human-Computer Interaction
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the performance differences between two large language models (GPT - 4 and Chat - GPT) in mental health care, in order to determine their potential applicability in providing psychological support. Specifically, the research evaluates their effectiveness and safety in clinical psychology applications by comparing the responses of these two models to a set of 18 psychological problems. ### Research Background With the rapid development of natural language processing technology, large language models (LLMs) show great potential in assisting clinicians and supporting individuals experiencing various psychological challenges. If these models can be effectively applied to mental health care, they may bring revolutionary changes. ### Research Objectives This study aims to compare the performance of GPT - 4 and Chat - GPT in responding to 18 psychological problems through the blind - rating method and evaluate their potential application value in the mental health care environment. ### Research Methods - **Blind - rating Method**: A clinical psychologist evaluates the responses of the two models without knowing the source of the models. - **Problem Scope**: Covers multiple mental health topics such as depression, anxiety, and trauma to ensure comprehensive evaluation. - **Evaluation Criteria**: The psychologist scores according to the clinical relevance and empathy of the model responses, with a full score of 10 points. ### Main Findings - **Performance Differences**: The average score of GPT - 4 is 8.29, while that of Chat - GPT is 6.52, with a significant difference (p < 0.05). - **Advantages of GPT - 4**: GPT - 4 performs better in generating clinically relevant and empathetic responses and can better support and guide potential users. - **Disadvantages of Chat - GPT**: The responses of Chat - GPT are more generalized, lack depth, and sometimes deviate from evidence - based practice. ### Conclusions This study provides an important reference for the application of large language models in the field of mental health care. Although GPT - 4 performs well, the study emphasizes the importance of continuous research and development to optimize the application of these models in the clinical environment. Future research needs to further explore the specific reasons for the performance differences of the models and explore their universality in different populations and mental health conditions.