Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

Xuhai Xu,Bingsheng Yao,Yuanzhe Dong,Saadia Gabriel,Hong Yu,James Hendler,Marzyeh Ghassemi,Anind K. Dey,Dakuo Wang
DOI: https://doi.org/10.1145/3643540
2024-01-29
Abstract:Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the inadequacy of large language models (LLMs) in the task of mental health prediction. Specifically, the researchers seek to evaluate the performance of multiple LLMs on various mental health prediction tasks using online text data and explore how different techniques (such as zero-shot prompting, few-shot prompting, and instruction fine-tuning) can enhance the performance of these models. ### Background and Motivation In recent years, large language models (such as GPT-4, PaLM, FLAN-T5, etc.) have demonstrated strong capabilities in various tasks, especially in zero-shot settings. However, in the field of mental health, despite extensive research in natural language processing (NLP) and computational social science, most studies still focus on building machine learning models for specific domains that require fine-tuning for specific tasks. Additionally, existing general-purpose LLMs have not been specifically trained for mental health tasks, thus their performance in this field is limited. ### Research Objectives 1. **Evaluate the performance of existing LLMs on mental health tasks**: The researchers selected multiple LLMs (including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4) and evaluated their performance on mental health prediction tasks using zero-shot prompting, few-shot prompting, and instruction fine-tuning. 2. **Explore methods to enhance LLM performance**: Through experiments, the researchers validated the effectiveness of different techniques, particularly the significant improvement in model performance through instruction fine-tuning. 3. **Release open-source models**: The researchers released the fine-tuned Mental-Alpaca and Mental-FLAN-T5 models for use by other researchers and developers. ### Main Contributions 1. **Comprehensive Evaluation**: Provided performance evaluations of various LLMs on mental health tasks, including zero-shot prompting, few-shot prompting, and instruction fine-tuning. 2. **Performance Enhancement**: Significantly improved LLM performance on multiple mental health tasks through instruction fine-tuning, especially in terms of balanced accuracy. 3. **Open-Source Models**: Released the Mental-Alpaca and Mental-FLAN-T5 models, which performed excellently on multiple datasets, even surpassing state-of-the-art task-specific models. 4. **Technical Guidelines**: Provided technical guidelines for future researchers and developers on applying LLMs to specific domains, emphasizing ethical considerations. ### Methods 1. **Zero-Shot Prompting**: Designed various prompt templates to evaluate LLM performance on mental health tasks without additional data. 2. **Few-Shot Prompting**: Enhanced the model's contextual understanding by providing a small number of examples. 3. **Instruction Fine-Tuning**: Fine-tuned LLMs using multiple mental health datasets to enable them to handle various tasks simultaneously. ### Results 1. **Zero-Shot Prompting**: Showed some potential but had limited performance on most tasks. 2. **Few-Shot Prompting**: Providing a small number of examples moderately improved model performance, but the effect was limited. 3. **Instruction Fine-Tuning**: Significantly improved model performance on multiple mental health tasks, especially in terms of balanced accuracy. The fine-tuned Mental-Alpaca and Mental-FLAN-T5 models performed excellently on multiple datasets, even surpassing GPT-3.5 and GPT-4. ### Ethical Considerations The researchers emphasized the important ethical risks that need to be considered when deploying LLMs in mental health scenarios, particularly known issues of racial and gender bias. These risks need to be fully addressed in further research and practical applications.