Adapting Open-Source Large Language Models for Cost-Effective, Expert-Level Clinical Note Generation with On-Policy Reinforcement Learning

Hanyin Wang,Chufan Gao,Bolun Liu,Qiping Xu,Guleid Hussein,Mohamad El Labban,Kingsley Iheasirim,Hariprasad Korsapati,Chuck Outcalt,Jimeng Sun
2024-06-10
Abstract:Proprietary Large Language Models (LLMs) such as GPT-4 and Gemini have demonstrated promising capabilities in clinical text summarization tasks. However, due to patient data privacy concerns and computational costs, many healthcare providers prefer using small, locally-hosted models over external generic LLMs. This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model, enabling it to generate high-quality clinical notes from outpatient patient-doctor dialogues. Our process incorporates continued pre-training, supervised fine-tuning, and reinforcement learning from both AI and human feedback. We introduced a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model. Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians. In a blinded physician reader study, the majority (90.4%) of individual evaluations rated the notes generated by LLaMA-Clinic as "acceptable" or higher across all three criteria: real-world readiness, completeness, and accuracy. In the more challenging "Assessment and Plan" section, LLaMA-Clinic scored higher (4.2/5) in real-world readiness than physician-authored notes (4.1/5). Our cost analysis for inference shows that our LLaMA-Clinic model achieves a 3.75-fold cost reduction compared to an external generic LLM service. Additionally, we highlight key considerations for future clinical note-generation tasks, emphasizing the importance of pre-defining a best-practice note format, rather than relying on LLMs to determine this for clinical practice. We have made our newly created synthetic clinic dialogue-note dataset and the physician feedback dataset publicly available to foster future research.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem discussed in this paper is how to adapt open-source large language models (LLMs) to achieve cost-effective, professional-level clinical note generation using online policy reinforcement learning. The study mentioned that while proprietary large language models such as GPT-4 perform well in clinical text summarization tasks, many healthcare institutions prefer to use small, locally hosted models due to patient data privacy and computational cost issues. The paper proposes a comprehensive domain and task-specific adaptation process for continuous pre-training, supervised fine-tuning, and reinforcement learning of the open-source LLaMA-2 model, combined with artificial intelligence and human feedback. The study also introduces a new method called "DistillDirect" for reinforcement learning in a direct preference optimization process. The main achievements of the research include: 1. Development of a model called LLaMA-Clinic, which can generate notes of comparable quality to those written by doctors, particularly in the "assessment and plan" section, with higher practical application readiness scores than doctor notes. 2. Proposal of a new reinforcement learning algorithm, DistillDirect, for online learning during the model distillation process. 3. Cost analysis shows that the inference cost of the LLaMA-Clinic model is 3.75 times lower compared to external general LLM services. 4. The study emphasizes the importance of predefined best practice note formats instead of relying on LLMs to determine appropriate formats in clinical practice. 5. The paper provides synthetic clinic dialogue note datasets and doctor feedback datasets to facilitate future research. The research demonstrates how the LLaMA-2 model can be improved for clinical note generation tasks through a series of experiments, including continued pre-training, supervised fine-tuning, and reinforcement learning. Ultimately, based on blind evaluations by doctors, most assessments consider the notes generated by LLaMA-Clinic to meet or exceed the standards of doctors in terms of real-world applicability, completeness, and accuracy. Moreover, the paper discusses the cost advantages and data security of using open-source models compared to proprietary models.