Adapting Open-Source Large Language Models for Cost-Effective, Expert-Level Clinical Note Generation with On-Policy Reinforcement Learning

Hanyin Wang,Chufan Gao,Bolun Liu,Qiping Xu,Guleid Hussein,Mohamad El Labban,Kingsley Iheasirim,Hariprasad Korsapati,Chuck Outcalt,Jimeng Sun

2024-06-10

Abstract:Proprietary Large Language Models (LLMs) such as GPT-4 and Gemini have demonstrated promising capabilities in clinical text summarization tasks. However, due to patient data privacy concerns and computational costs, many healthcare providers prefer using small, locally-hosted models over external generic LLMs. This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model, enabling it to generate high-quality clinical notes from outpatient patient-doctor dialogues. Our process incorporates continued pre-training, supervised fine-tuning, and reinforcement learning from both AI and human feedback. We introduced a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model. Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians. In a blinded physician reader study, the majority (90.4%) of individual evaluations rated the notes generated by LLaMA-Clinic as "acceptable" or higher across all three criteria: real-world readiness, completeness, and accuracy. In the more challenging "Assessment and Plan" section, LLaMA-Clinic scored higher (4.2/5) in real-world readiness than physician-authored notes (4.1/5). Our cost analysis for inference shows that our LLaMA-Clinic model achieves a 3.75-fold cost reduction compared to an external generic LLM service. Additionally, we highlight key considerations for future clinical note-generation tasks, emphasizing the importance of pre-defining a best-practice note format, rather than relying on LLMs to determine this for clinical practice. We have made our newly created synthetic clinic dialogue-note dataset and the physician feedback dataset publicly available to foster future research.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem discussed in this paper is how to adapt open-source large language models (LLMs) to achieve cost-effective, professional-level clinical note generation using online policy reinforcement learning. The study mentioned that while proprietary large language models such as GPT-4 perform well in clinical text summarization tasks, many healthcare institutions prefer to use small, locally hosted models due to patient data privacy and computational cost issues. The paper proposes a comprehensive domain and task-specific adaptation process for continuous pre-training, supervised fine-tuning, and reinforcement learning of the open-source LLaMA-2 model, combined with artificial intelligence and human feedback. The study also introduces a new method called "DistillDirect" for reinforcement learning in a direct preference optimization process. The main achievements of the research include: 1. Development of a model called LLaMA-Clinic, which can generate notes of comparable quality to those written by doctors, particularly in the "assessment and plan" section, with higher practical application readiness scores than doctor notes. 2. Proposal of a new reinforcement learning algorithm, DistillDirect, for online learning during the model distillation process. 3. Cost analysis shows that the inference cost of the LLaMA-Clinic model is 3.75 times lower compared to external general LLM services. 4. The study emphasizes the importance of predefined best practice note formats instead of relying on LLMs to determine appropriate formats in clinical practice. 5. The paper provides synthetic clinic dialogue note datasets and doctor feedback datasets to facilitate future research. The research demonstrates how the LLaMA-2 model can be improved for clinical note generation tasks through a series of experiments, including continued pre-training, supervised fine-tuning, and reinforcement learning. Ultimately, based on blind evaluations by doctors, most assessments consider the notes generated by LLaMA-Clinic to meet or exceed the standards of doctors in terms of real-world applicability, completeness, and accuracy. Moreover, the paper discusses the cost advantages and data security of using open-source models compared to proprietary models.

Adapting Open-Source Large Language Models for Cost-Effective, Expert-Level Clinical Note Generation with On-Policy Reinforcement Learning

PMC-LLaMA: toward building open-source language models for medicine

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

PMC-LLaMA: Towards Building Open-source Language Models for Medicine

Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine

Distilling Large Language Models for Matching Patients to Clinical Trials

A Dataset and Benchmark for Hospital Course Summarization with Adapted Large Language Models

Optimal strategies for adapting open-source large language models for clinical information extraction: a benchmarking study in the context of ulcerative colitis research

Enhancing Early Detection of Cognitive Decline in the Elderly: A Comparative Study Utilizing Large Language Models in Clinical Notes

Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding

Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models

WangLab at MEDIQA-Chat 2023: Clinical Note Generation from Doctor-Patient Conversations using Large Language Models

Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Adapted large language models can outperform medical experts in clinical text summarization

Evaluating the use of large language models to provide clinical recommendations in the Emergency Department