Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization

Dave Van Veen,Cara Van Uden,Louis Blankemeier,Jean-Benoit Delbrouck,Asad Aali,Christian Bluethgen,Anuj Pareek,Malgorzata Polacin,Eduardo Pontes Reis,Anna Seehofnerova,Nidhi Rohatgi,Poonam Hosamani,William Collins,Neera Ahuja,Curtis P. Langlotz,Jason Hom,Sergios Gatidis,John Pauly,Akshay S. Chaudhari
DOI: https://doi.org/10.1038/s41591-024-02855-5
2024-04-12
Abstract:Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP), their effectiveness on a diverse range of clinical summarization tasks remains unproven. In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Quantitative assessments with syntactic, semantic, and conceptual NLP metrics reveal trade-offs between models and adaptation methods. A clinical reader study with ten physicians evaluates summary completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts. The ensuing safety analysis highlights challenges faced by both LLMs and medical experts, as we connect errors to potential medical harm and categorize types of fabricated information. Our research provides evidence of LLMs outperforming medical experts in clinical text summarization across multiple tasks. This suggests that integrating LLMs into clinical workflows could alleviate documentation burden, allowing clinicians to focus more on patient care.
Computation and Language
What problem does this paper attempt to address?
This paper explores the application of large-scale language models (LLMs) in clinical text summarization, with the aim of demonstrating whether adapted LLMs can perform equally or better than medical experts on multiple clinical tasks. The study involves the application of adaptive methods to eight LLMs, covering four different types of summarization tasks including radiology reports, patient questions, progress notes, and doctor-patient dialogues. Through quantitative evaluation using NLP metrics and a reading study with clinical physicians, the results show that the best adapted LLMs can generate summaries that are comparable to or better than those produced by medical experts in most cases. Additionally, the paper analyzes the safety of LLMs and identifies potential medical risks and error types. The findings suggest that LLMs may outperform medical experts in clinical text summarization, reducing documentation burden and improving the quality of patient care.