Abstract:Advances in large language models (LLMs) provide new opportunities in healthcare for improved patient care, clinical decision-making, and enhancement of physician and administrator workflows. However, the potential of these models importantly depends on their ability to generalize effectively across clinical environments and populations, a challenge often underestimated in early development. To better understand reasons for these challenges and inform mitigation approaches, we evaluated ClinicLLM, an LLM trained on [HOSPITAL]'s clinical notes, analyzing its performance on 30-day all-cause readmission prediction focusing on variability across hospitals and patient characteristics. We found poorer generalization particularly in hospitals with fewer samples, among patients with government and unspecified insurance, the elderly, and those with high comorbidities. To understand reasons for lack of generalization, we investigated sample sizes for fine-tuning, note content (number of words per note), patient characteristics (comorbidity level, age, insurance type, borough), and health system aspects (hospital, all-cause 30-day readmission, and mortality rates). We used descriptive statistics and supervised classification to identify features. We found that, along with sample size, patient age, number of comorbidities, and the number of words in notes are all important factors related to generalization. Finally, we compared local fine-tuning (hospital specific), instance-based augmented fine-tuning and cluster-based fine-tuning for improving generalization. Among these, local fine-tuning proved most effective, increasing AUC by 0.25% to 11.74% (most helpful in settings with limited data). Overall, this study provides new insights for enhancing the deployment of large language models in the societally important domain of healthcare, and improving their performance for broader populations.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate and improve the generalization ability of large - language models (LLMs) in the healthcare field, especially their performance in different hospitals and patient populations. Specifically, the research focuses on the following aspects: 1. **Generalization evaluation**: Analyze the performance of a large - language model named ClinicLLM in the task of predicting all - cause readmission within 30 days, especially its performance differences between different hospitals and different patient populations (classified by insurance type, race, age group, and comorbidity level). 2. **Reasons for lack of generalization**: Explore factors affecting model generalization through descriptive statistics and supervised learning methods, including the sample size used for fine - tuning, the content of medical record notes (the number of words in each note), patient characteristics (comorbidity level, age, insurance type, residential area), and medical system characteristics (hospital, 30 - day all - cause readmission rate, and mortality rate). 3. **Strategies for improving generalization**: Three fine - tuning methods were tested to improve the generalization ability of the model: - **Local fine - tuning (hospital - specific)**: Fine - tune independently for each hospital. - **Instance - based enhanced fine - tuning**: Fine - tune after data augmentation using similar notes from other hospitals. - **Cluster - based fine - tuning**: Cluster patient notes and then fine - tune to capture patient populations with similar characteristics. ### Research background In recent years, the application of large - language models (LLMs) in the healthcare field has made significant progress, such as improving patient care, providing clinical decision support, and optimizing the work processes of doctors and administrators. However, the effectiveness and reliability of these models depend on their ability to maintain consistent performance in different clinical settings and among different patient populations. This challenge was often underestimated in the early development, resulting in the models' performance in practical applications being worse than expected. ### Main findings - **Generalization differences between hospitals**: The study found that in hospitals with a small sample size (such as Hospital 3 and Hospital 4), as well as in groups of patients with government insurance, unspecified insurance, elderly patients, and high - comorbidity patients, the generalization ability of ClinicLLM is poor. - **Factors affecting generalization**: Besides the sample size, the patient's age, the number of comorbidities, and the number of words in the notes are all important factors affecting generalization. - **Effectiveness of improvement strategies**: Among the three fine - tuning methods, local fine - tuning (hospital - specific) is the most effective, which can increase the AUC by 0.25% to 11.74%, especially in cases with limited data. ### Conclusion This study provides new insights for deploying large - language models in the healthcare field and proposes specific improvement measures to improve the performance of the models in a broader population. This not only helps to improve the accuracy and reliability of the models but also has important significance for improving the quality and efficiency of medical services. ### Formula representation In this paper, the main formula involved is for calculating the proportion of AUC change to measure the effectiveness of different fine - tuning methods: \[ \text{Proportional AUC Change}=\frac{\text{AUC}_{\text{Local/Instance/Cluster}}-\text{AUC}_{\text{Base, Specific Hospital}}}{\text{AUC}_{\text{Base, Specific Hospital}}} \] For global fine - tuning, the formula for the proportion of AUC change is: \[ \text{Proportional AUC Change}=\frac{\text{AUC}_{\text{Global}}-\text{AUC}_{\text{Base, Global}}}{\text{AUC}_{\text{Base, Global}}} \] where \(\text{AUC}_{\text{Base, Global}}\) is the AUC value of global training, covering the entire data set without considering any specific groups.

Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Large Language Models in Healthcare: A Comprehensive Benchmark

Evaluating large language models in medical applications: a survey

Large Language Models Illuminate a Progressive Pathway to Artificial Intelligent Healthcare Assistant

Large language models in healthcare and medical domain: A review

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Large language models encode clinical knowledge

Clinical Insights: A Comprehensive Review of Language Models in Medicine

Large language models in medical and healthcare fields: applications, advances, and challenges

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data

Large language models in medicine: the potentials and pitfalls

Large Language Models as Agents in the Clinic

Benchmarking the Confidence of Large Language Models in Clinical Questions

Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review

Large Language Models in the Medical Field: Principles and Applications

Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching