Abstract:Large language models have exhibited exceptional performance on various Natural Language Processing (NLP) tasks, leveraging techniques such as the pre-training, and instruction fine-tuning. Despite these advances, their effectiveness in medical applications is limited, due to challenges such as factual inaccuracies, reasoning abilities, and lack grounding in real-world experience. In this study, we present ClinicalGPT, a language model explicitly designed and optimized for clinical scenarios. By incorporating extensive and diverse real-world data, such as medical records, domain-specific knowledge, and multi-round dialogue consultations in the training process, ClinicalGPT is better prepared to handle multiple clinical task. Furthermore, we introduce a comprehensive evaluation framework that includes medical knowledge question-answering, medical exams, patient consultations, and diagnostic analysis of medical records. Our results demonstrate that ClinicalGPT significantly outperforms other models in these tasks, highlighting the effectiveness of our approach in adapting large language models to the critical domain of healthcare.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced in using large - language models (LLMs) in medical applications. Although existing large - language models perform well in natural - language - processing tasks, their applications in the medical field are restricted by the following: 1. **Factual accuracy**: Existing large - language models may produce factual errors, which are unacceptable in medical scenarios. 2. **Reasoning ability**: These models perform poorly when handling complex medical - reasoning tasks. 3. **Lack of real - world experience**: Due to the lack of training with real - world medical data, these models often give general answers and lack specific medical insights. To overcome these challenges, the authors propose **ClinicalGPT**, a large - language model specifically designed and optimized for clinical scenarios. By introducing a large amount of real - world medical data (such as electronic medical records, domain - specific knowledge, and multi - round - dialogue consultations), ClinicalGPT can better handle various clinical tasks. In addition, the authors have established a comprehensive evaluation framework, including tasks such as medical - knowledge Q&A, medical examinations, patient consultations, and medical - record analysis, to verify the performance of ClinicalGPT. Specifically, the main contributions of this paper include: - **Dataset construction**: Utilize diverse medical datasets, including cMedQA2, cMedQA - KG, MD - EHR, MEDQA - MCMLE, and MedDialog, etc., to train and evaluate the model. - **Fine - tuning method**: Adopt a method that combines supervised fine - tuning (SFT) and reinforcement learning (RL) to improve the performance of the model. - **Evaluation framework**: Design a set of comprehensive evaluation indicators covering multiple medical - application scenarios to ensure the practicality and reliability of the model. Through these methods, ClinicalGPT significantly outperforms other existing large - language models in multiple medical tasks, demonstrating its potential and advantages in the medical field.

ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation

DoctorGPT: A Large Language Model with Chinese Medical Question-Answering Capabilities

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

A Study of Generative Large Language Model for Medical Research and Healthcare

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Uncovering Language Disparity of ChatGPT in Healthcare: Non-English Clinical Environment for Retinal Vascular Disease Classification (Preprint)

Radiology-GPT: A Large Language Model for Radiology

Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding

A large language model for electronic health records

Large language models encode clinical knowledge

Evaluation of large language models in breast cancer clinical scenarios: A comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2

Large Language Models for Efficient Medical Information Extraction

Large Language Models are Few-Shot Clinical Information Extractors

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Large language models streamline automated machine learning for clinical studies

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Matching Patients to Clinical Trials with Large Language Models

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark