Abstract:Background: Recent advancements in large language models (LLMs) have accelerated their use across various domains. Psychiatric interviews, which are goal-oriented and structured, represent a significantly underexplored area where LLMs can provide substantial value. In this study, we explore the application of LLMs to enhance psychiatric interviews by analyzing counseling data from North Korean defectors who have experienced traumatic events and mental health issues. Objective: This study aims to investigate whether LLMs can (1) delineate parts of the conversation that suggest psychiatric symptoms and identify those symptoms, and (2) summarize stressors and symptoms based on the interview dialogue transcript. Methods: Given the interview transcripts, we align the LLMs to perform 3 tasks: (1) extracting stressors from the transcripts, (2) delineating symptoms and their indicative sections, and (3) summarizing the patients based on the extracted stressors and symptoms. These 3 tasks address the 2 objectives, where delineating symptoms is based on the output from the second task, and generating the summary of the interview incorporates the outputs from all 3 tasks. In this context, the transcript data were labeled by mental health experts for the training and evaluation of the LLMs. Results: First, we present the performance of LLMs in estimating (1) the transcript sections related to psychiatric symptoms and (2) the names of the corresponding symptoms. In the zero-shot inference setting using the GPT-4 Turbo model, 73 out of 102 transcript segments demonstrated a recall mid-token distance d<20 for estimating the sections associated with the symptoms. For evaluating the names of the corresponding symptoms, the fine-tuning method demonstrates a performance advantage over the zero-shot inference setting of the GPT-4 Turbo model. On average, the fine-tuning method achieves an accuracy of 0.82, a precision of 0.83, a recall of 0.82, and an F1-score of 0.82. Second, the transcripts are used to generate summaries for each interviewee using LLMs. This generative task was evaluated using metrics such as Generative Evaluation (G-Eval) and Bidirectional Encoder Representations from Transformers Score (BERTScore). The summaries generated by the GPT-4 Turbo model, utilizing both symptom and stressor information, achieve high average G-Eval scores: coherence of 4.66, consistency of 4.73, fluency of 2.16, and relevance of 4.67. Furthermore, it is noted that the use of retrieval-augmented generation did not lead to a significant improvement in performance. Conclusions: LLMs, using either (1) appropriate prompting techniques or (2) fine-tuning methods with data labeled by mental health experts, achieved an accuracy of over 0.8 for the symptom delineation task when measured across all segments in the transcript. Additionally, they attained a G-Eval score of over 4.6 for coherence in the summarization task. This research contributes to the emerging field of applying LLMs in psychiatric interviews and demonstrates their potential effectiveness in assisting mental health practitioners.

A Novel Nuanced Conversation Evaluation Framework for Large Language Models in Mental Health

Can AI Relate: Testing Large Language Model Response for Mental Health Support

Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

A Comprehensive Evaluation of Large Language Models on Mental Illnesses

A Computational Framework for Behavioral Assessment of LLM Therapists

The Role of Humanization and Robustness of Large Language Models in Conversational Artificial Intelligence for Individuals With Depression: A Critical Analysis

Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment

PALLM: Evaluating and Enhancing PALLiative Care Conversations with Large Language Models

Towards Interpretable Mental Health Analysis with Large Language Models

Harnessing Large Language Models' Empathetic Response Generation Capabilities for Online Mental Health Counselling Support

Aligning Large Language Models for Enhancing Psychiatric Interviews Through Symptom Delineation and Summarization: Pilot Study

Large Language Models and Healthcare Alliance: Potential and Challenges of Two Representative Use Cases

Aligning Large Language Models for Enhancing Psychiatric Interviews through Symptom Delineation and Summarization

An Assessment on Comprehending Mental Health through Large Language Models

Supporting the Demand on Mental Health Services with AI-Based Conversational Large Language Models (LLMs)

Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning

Leveraging Large Language Models for Patient Engagement: The Power of Conversational AI in Digital Health

An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models

Using large language models for extracting and pre-annotating texts on mental health from noisy data in a low-resource language

LLM-Mini-CEX: Automatic Evaluation of Large Language Model for Diagnostic Conversation

Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks