A Comparative Study of Recent Large Language Models on Generating Hospital Discharge Summaries for Lung Cancer Patients

Yiming Li,Fang Li,Kirk Roberts,Licong Cui,Cui Tao,Hua Xu
2024-11-06
Abstract:Generating discharge summaries is a crucial yet time-consuming task in clinical practice, essential for conveying pertinent patient information and facilitating continuity of care. Recent advancements in large language models (LLMs) have significantly enhanced their capability in understanding and summarizing complex medical texts. This research aims to explore how LLMs can alleviate the burden of manual summarization, streamline workflow efficiencies, and support informed decision-making in healthcare settings. Clinical notes from a cohort of 1,099 lung cancer patients were utilized, with a subset of 50 patients for testing purposes, and 102 patients used for model fine-tuning. This study evaluates the performance of multiple LLMs, including GPT-3.5, GPT-4, GPT-4o, and LLaMA 3 8b, in generating discharge summaries. Evaluation metrics included token-level analysis (BLEU, ROUGE-1, ROUGE-2, ROUGE-L) and semantic similarity scores between model-generated summaries and physician-written gold standards. LLaMA 3 8b was further tested on clinical notes of varying lengths to examine the stability of its performance. The study found notable variations in summarization capabilities among LLMs. GPT-4o and fine-tuned LLaMA 3 demonstrated superior token-level evaluation metrics, while LLaMA 3 consistently produced concise summaries across different input lengths. Semantic similarity scores indicated GPT-4o and LLaMA 3 as leading models in capturing clinical relevance. This study contributes insights into the efficacy of LLMs for generating discharge summaries, highlighting LLaMA 3's robust performance in maintaining clarity and relevance across varying clinical contexts. These findings underscore the potential of automated summarization tools to enhance documentation precision and efficiency, ultimately improving patient care and operational capability in healthcare settings.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to automatically generate discharge summaries for lung cancer patients, in order to reduce the manual summarization burden on clinicians, improve work efficiency, and support medical decision - making. Specifically, the research aims to explore the effectiveness of large - language models (LLMs) in generating discharge summaries and evaluate their ability to handle complex medical texts by comparing the performance of different LLMs. ### Research Background - **Importance of Discharge Summaries**: Discharge summaries are a crucial but time - consuming task in clinical practice and are essential for communicating patient information and ensuring medical continuity. - **Development of Large - Language Models**: In recent years, large - language models (such as GPT - 4 and LLaMA 3) have made significant progress in understanding and summarizing complex medical texts, and these models are expected to play an important role in automatically generating discharge summaries. ### Research Objectives - **Evaluating the Performance of Different LLMs**: The research evaluates the advantages and disadvantages of multiple LLMs (including GPT - 3.5, GPT - 4, GPT - 4o, and LLaMA 3) in terms of token - level and semantic similarity by comparing their performance in generating discharge summaries for lung cancer patients. - **Exploring the Stability of LLMs under Different Input Lengths**: Special attention is paid to the performance of LLaMA 3 when handling clinical notes of different lengths to verify its stability and reliability in actual clinical applications. ### Research Methods - **Data Set**: The research uses data from 1,099 lung cancer patients, of which the records of 50 patients are used for testing and the records of 102 patients are used for model fine - tuning. - **Evaluation Metrics**: Including token - level evaluation metrics (BLEU, ROUGE - 1, ROUGE - 2, ROUGE - L) and semantic similarity scores. - **Experimental Setup**: Use pre - trained GPT models and fine - tuned LLaMA 3 models for experiments and generate discharge summaries through specific prompting strategies. ### Main Findings - **GPT - 4o and Fine - Tuned LLaMA 3 Perform Excellently in Token - Level Evaluation**: Especially on the BLEU, ROUGE - 1, ROUGE - 2, and ROUGE - L metrics. - **The Summaries Generated by LLaMA 3 under Different Input Lengths Maintain Consistent Conciseness**: This indicates that LLaMA 3 has strong adaptability and stability when handling complex medical texts. - **Semantic Similarity Scores**: LLaMA 3 performs well in semantic similarity and can effectively capture clinical relevance. ### Conclusions - **Research Contributions**: This research provides valuable insights into the automatic generation of discharge summaries using LLMs, especially the robust performance of LLaMA 3 in maintaining clarity and relevance. - **Potential Impact**: The development of tools for automatically generating discharge summaries can improve the accuracy and efficiency of documents and ultimately improve patient care and the operational capacity of the medical system. ### Future Directions - **Further Optimizing the Model**: Improve the performance of the model in handling complex medical scenarios through more abundant training data and more refined fine - tuning strategies. - **Multi - modal Data Fusion**: Explore the combination of various types of data such as images and laboratory results to generate more comprehensive discharge summaries. - **Promoting Clinical Applications**: Apply the research results in the actual medical environment and evaluate their effectiveness and acceptance in the real - world clinical work flow.