Abstract:The latest large language models (LLMs) such as ChatGPT, exhibit strong capabilities in automated mental health analysis. However, existing relevant studies bear several limitations, including inadequate evaluations, lack of prompting strategies, and ignorance of exploring LLMs for explainability. To bridge these gaps, we comprehensively evaluate the mental health analysis and emotional reasoning ability of LLMs on 11 datasets across 5 tasks. We explore the effects of different prompting strategies with unsupervised and distantly supervised emotional information. Based on these prompts, we explore LLMs for interpretable mental health analysis by instructing them to generate explanations for each of their decisions. We convey strict human evaluations to assess the quality of the generated explanations, leading to a novel dataset with 163 human-assessed explanations. We benchmark existing automatic evaluation metrics on this dataset to guide future related works. According to the results, ChatGPT shows strong in-context learning ability but still has a significant gap with advanced task-specific methods. Careful prompt engineering with emotional cues and expert-written few-shot examples can also effectively improve performance on mental health analysis. In addition, ChatGPT generates explanations that approach human performance, showing its great potential in explainable mental health analysis.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the limitations of current large - language models (LLMs) in mental health analysis, including insufficient evaluation, lack of effective prompting strategies, and neglect of the exploration of model interpretability. Specifically: 1. **Performance and Interpretability**: Although the latest large - language models such as ChatGPT have shown strong capabilities in automatic mental health analysis, there are several main problems in existing research: - **Insufficient Evaluation**: Most existing research has only been tested on a few binary - classification mental health condition detection tasks, lacking a comprehensive evaluation of more complex tasks (such as emotion reasoning and cause detection). - **Lack of Prompting Strategies**: Most research uses simple prompts to directly detect mental health conditions, ignoring the use of useful information such as emotional cues. - **Lack of Interpretability**: Existing research rarely explores how to generate interpretable mental health analysis results through large - language models, lacking transparency and credibility. 2. **Research Objectives**: - **RQ1**: How capable are large - language models in general mental health analysis and emotion reasoning in zero - shot / few - shot settings? - **RQ2**: How do different prompting strategies and emotional cues affect ChatGPT's mental health analysis capabilities? - **RQ3**: Can ChatGPT generate reasonable explanations for its mental health analysis decisions? To answer these research questions, the author has carried out the following work: - **Preliminary Research**: Evaluate the performance of four large - language models of different scales (including ChatGPT, InstructGPT - 3, LLaMA - 13B, and LLaMA - 7B) on mental health analysis and emotion reasoning tasks. - **Prompting Strategies**: Systematically explore different prompting strategies, including zero - shot prompting, chain - of - thought (CoT) prompting, emotion - enhanced prompting, and few - shot text emotion - enhanced prompting. - **Interpretability Exploration**: Instruct two representative models (ChatGPT and InstructGPT - 3) to generate natural - language explanations, and conduct manual evaluation through a strict annotation protocol, creating a new dataset containing 163 manually - evaluated explanations. - **Automatic Evaluation**: Benchmark existing automatic evaluation metrics to guide future research on automatic evaluation of interpretable mental health analysis. Through these studies, the author aims to improve the performance and interpretability of large - language models in mental health analysis and provide guidance for future related research.

Towards Interpretable Mental Health Analysis with Large Language Models

MentaLLaMA: Interpretable Mental Health Analysis on Social Media with Large Language Models

MentalGLM Series: Explainable Large Language Models for Mental Health Analysis on Chinese Social Media

Rethinking Large Language Models in Mental Health Applications

Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

A Comprehensive Evaluation of Large Language Models on Mental Illnesses

An Assessment on Comprehending Mental Health through Large Language Models

Unveiling and Mitigating Bias in Mental Health Analysis with Large Language Models

Large Language Models Perform on Par with Experts Identifying Mental Health Factors in Adolescent Online Forums

Toward explainable AI (XAI) for mental health detection based on language behavior

Can AI Relate: Testing Large Language Model Response for Mental Health Support

Large Language Model for Mental Health: A Systematic Review

PsyEval: A Comprehensive Large Language Model Evaluation Benchmark for Mental Health

From Classification to Clinical Insights: Towards Analyzing and Reasoning About Mobile and Behavioral Health Data With Large Language Models

A Dual-Prompting for Interpretable Mental Health Language Models

Applying and Evaluating Large Language Models in Mental Health Care: A Scoping Review of Human-Assessed Generative Tasks

Applications of large language models in psychiatry: a systematic review

WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions

A Novel Nuanced Conversation Evaluation Framework for Large Language Models in Mental Health

Large Language Models in Mental Health Care: a Scoping Review

Emotional intelligence of Large Language Models