Large Language Models Perform on Par with Experts Identifying Mental Health Factors in Adolescent Online Forums

Isabelle Lorge,Dan W. Joyce,Andrey Kormilitzin
2024-04-26
Abstract:Mental health in children and adolescents has been steadily deteriorating over the past few years. The recent advent of Large Language Models (LLMs) offers much hope for cost and time efficient scaling of monitoring and intervention, yet despite specifically prevalent issues such as school bullying and eating disorders, previous studies on have not investigated performance in this domain or for open information extraction where the set of answers is not predetermined. We create a new dataset of Reddit posts from adolescents aged 12-19 annotated by expert psychiatrists for the following categories: TRAUMA, PRECARITY, CONDITION, SYMPTOMS, SUICIDALITY and TREATMENT and compare expert labels to annotations from two top performing LLMs (GPT3.5 and GPT4). In addition, we create two synthetic datasets to assess whether LLMs perform better when annotating data as they generate it. We find GPT4 to be on par with human inter-annotator agreement and performance on synthetic data to be substantially higher, however we find the model still occasionally errs on issues of negation and factuality and higher performance on synthetic data is driven by greater complexity of real data rather than inherent advantage.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address some key issues in adolescent mental health monitoring and intervention, particularly the effectiveness and accuracy of using large language models (LLMs) to identify mental health factors in social media data. Specifically, the research objectives include: 1. **Generate and annotate high-quality datasets**: Create a new dataset containing posts from adolescents (aged 12-19) on Reddit, annotated by professional psychiatrists, covering six categories: trauma, instability factors, disease conditions, symptoms, suicidal tendencies, and treatment. 2. **Evaluate the performance of LLMs**: Compare the performance of two top LLMs (GPT-3.5 and GPT-4) in extracting mental health factors from adolescent social media posts, verifying whether they can achieve a level comparable to expert annotators. 3. **Explore the utility of synthetic data**: Generate two synthetic datasets to evaluate the performance of LLMs in annotating while generating text, and explore the potential use of these data in training task-specific models. Through these objectives, the research hopes to provide an efficient and cost-effective method for monitoring and intervention in the field of adolescent mental health, while also offering new insights into the application of synthetic data in the healthcare domain.