The current status of large language models in summarizing radiology report impressions

Danqing Hu,Shanyuan Zhang,Qing Liu,Xiaofeng Zhu,Bing Liu

2024-06-04

Abstract:Large language models (LLMs) like ChatGPT show excellent capabilities in various natural language processing tasks, especially for text generation. The effectiveness of LLMs in summarizing radiology report impressions remains unclear. In this study, we explore the capability of eight LLMs on the radiology report impression summarization. Three types of radiology reports, i.e., CT, PET-CT, and Ultrasound reports, are collected from Peking University Cancer Hospital and Institute. We use the report findings to construct the zero-shot, one-shot, and three-shot prompts with complete example reports to generate the impressions. Besides the automatic quantitative evaluation metrics, we define five human evaluation metrics, i.e., completeness, correctness, conciseness, verisimilitude, and replaceability, to evaluate the semantics of the generated impressions. Two thoracic surgeons (ZSY and LB) and one radiologist (LQ) compare the generated impressions with the reference impressions and score each impression under the five human evaluation metrics. Experimental results show that there is a gap between the generated impressions and reference impressions. Although the LLMs achieve comparable performance in completeness and correctness, the conciseness and verisimilitude scores are not very high. Using few-shot prompts can improve the LLMs' performance in conciseness and verisimilitude, but the clinicians still think the LLMs can not replace the radiologists in summarizing the radiology impressions.

Computation and Language

What problem does this paper attempt to address?

This paper discusses the application ability of large-scale language models (LLMs) in radiology report impression summaries. The study used eight different LLMs, including commercial and open-source models, to experiment with zero-shot and few-shot (one-tap and three-tap) Chinese radiology report tasks for CT, PET-CT, and ultrasonography. The generated impressions were evaluated for their completeness, accuracy, conciseness, similarity, and substitutability through automated quantitative evaluation and manual evaluation by clinical experts. The paper found that although LLMs performed similarly in terms of completeness and accuracy, they did not score well in conciseness and similarity. The use of a few-shot prompt improved the performance of LLMs in conciseness and similarity, but they still could not fully replace radiologists in impression summarization. Among all evaluated LLMs, no model performed the best in all tasks, and commercial LLMs were generally superior to open-source LLMs. Additionally, the study pointed out that LLMs' performance varied among different types of radiology reports, with Tongyi Qianwen performing the best in PET-CT reports and ERNIE Bot performing the best in CT reports. In conclusion, the paper aims to clarify the current status of LLMs in Chinese radiology report impression summaries and highlight the gaps in their practical clinical application.

The current status of large language models in summarizing radiology report impressions

Constructing a Large Language Model to Generate Impressions from Findings in Radiology Reports

An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT

Evaluating Large Language Models for Radiology Natural Language Processing

Evaluation of large language models performance against humans for summarizing MRI knee radiology reports: A feasibility study

Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary

Patient Centric Summarization of Radiology Findings using Large Language Models

AI-Assisted Summarization of Radiologic Reports: Evaluating GPT3davinci, BARTcnn, LongT5booksum, LEDbooksum, LEDlegal, and LEDclinical

A Comparative Study of Recent Large Language Models on Generating Hospital Discharge Summaries for Lung Cancer Patients

Large language models for structured reporting in radiology: past, present, and future

Learning to Generate Radiology Findings from Impressions Based on Large Language Model

LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation

Evaluation of Radiology Residents' Reporting Skills Using Large Language Models: An Observational Study

EchoGPT: A Large Language Model for Echocardiography Report Summarization

Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports

Multilingual Natural Language Processing Model for Radiology Reports -- The Summary is all you need!

Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing

Multi-modal large language models in radiology: principles, applications, and potential

An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study

Can large language models be new supportive tools in coronary computed tomography angiography reporting?

Translating musculoskeletal radiology reports into patient-friendly summaries using ChatGPT-4