Abstract:Background: Comprehensive session summaries enable effective continuity in mental health counseling, facilitating informed therapy planning. However, manual summarization presents a significant challenge, diverting experts' attention from the core counseling process. Leveraging advances in automatic summarization to streamline the summarization process addresses this issue because this enables mental health professionals to access concise summaries of lengthy therapy sessions, thereby increasing their efficiency. However, existing approaches often overlook the nuanced intricacies inherent in counseling interactions. Objective: This study evaluates the effectiveness of state-of-the-art large language models (LLMs) in selectively summarizing various components of therapy sessions through aspect-based summarization, aiming to benchmark their performance. Methods: We first created Mental Health Counseling-Component-Guided Dialogue Summaries, a benchmarking data set that consists of 191 counseling sessions with summaries focused on 3 distinct counseling components (also known as counseling aspects). Next, we assessed the capabilities of 11 state-of-the-art LLMs in addressing the task of counseling-component-guided summarization. The generated summaries were evaluated quantitatively using standard summarization metrics and verified qualitatively by mental health professionals. Results: Our findings demonstrated the superior performance of task-specific LLMs such as MentalLlama, Mistral, and MentalBART evaluated using standard quantitative metrics such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-1, ROUGE-2, ROUGE-L, and Bidirectional Encoder Representations from Transformers Score across all aspects of the counseling components. Furthermore, expert evaluation revealed that Mistral superseded both MentalLlama and MentalBART across 6 parameters: affective attitude, burden, ethicality, coherence, opportunity costs, and perceived effectiveness. However, these models exhibit a common weakness in terms of room for improvement in the opportunity costs and perceived effectiveness metrics. Conclusions: While LLMs fine-tuned specifically on mental health domain data display better performance based on automatic evaluation scores, expert assessments indicate that these models are not yet reliable for clinical application. Further refinement and validation are necessary before their implementation in practice.

Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization

Adapted large language models can outperform medical experts in clinical text summarization

Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts

Adapting Large Language Models for Automated Summarisation of Electronic Medical Records in Clinical Coding

Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review

A Dataset and Benchmark for Hospital Course Summarization with Adapted Large Language Models

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

A Comparative Study of Recent Large Language Models on Generating Hospital Discharge Summaries for Lung Cancer Patients

Using large language models for safety-related table summarization in clinical study reports

Large language models encode clinical knowledge

Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data

Can Large Language Models Replace Data Scientists in Clinical Research?

Enhanced Electronic Health Records Text Summarization Using Large Language Models

A Framework to Assess Clinical Safety and Hallucination Rates of LLMs for Medical Text Summarisation

Harmonising the Clinical Melody: Tuning Large Language Models for Hospital Course Summarisation in Clinical Coding

Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review

Evaluating large language models in medical applications: a survey

Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study

Large language models in medical and healthcare fields: applications, advances, and challenges

The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?