The current status of large language models in summarizing radiology report impressions

Danqing Hu,Shanyuan Zhang,Qing Liu,Xiaofeng Zhu,Bing Liu
2024-06-04
Abstract:Large language models (LLMs) like ChatGPT show excellent capabilities in various natural language processing tasks, especially for text generation. The effectiveness of LLMs in summarizing radiology report impressions remains unclear. In this study, we explore the capability of eight LLMs on the radiology report impression summarization. Three types of radiology reports, i.e., CT, PET-CT, and Ultrasound reports, are collected from Peking University Cancer Hospital and Institute. We use the report findings to construct the zero-shot, one-shot, and three-shot prompts with complete example reports to generate the impressions. Besides the automatic quantitative evaluation metrics, we define five human evaluation metrics, i.e., completeness, correctness, conciseness, verisimilitude, and replaceability, to evaluate the semantics of the generated impressions. Two thoracic surgeons (ZSY and LB) and one radiologist (LQ) compare the generated impressions with the reference impressions and score each impression under the five human evaluation metrics. Experimental results show that there is a gap between the generated impressions and reference impressions. Although the LLMs achieve comparable performance in completeness and correctness, the conciseness and verisimilitude scores are not very high. Using few-shot prompts can improve the LLMs' performance in conciseness and verisimilitude, but the clinicians still think the LLMs can not replace the radiologists in summarizing the radiology impressions.
Computation and Language
What problem does this paper attempt to address?
This paper discusses the application ability of large-scale language models (LLMs) in radiology report impression summaries. The study used eight different LLMs, including commercial and open-source models, to experiment with zero-shot and few-shot (one-tap and three-tap) Chinese radiology report tasks for CT, PET-CT, and ultrasonography. The generated impressions were evaluated for their completeness, accuracy, conciseness, similarity, and substitutability through automated quantitative evaluation and manual evaluation by clinical experts. The paper found that although LLMs performed similarly in terms of completeness and accuracy, they did not score well in conciseness and similarity. The use of a few-shot prompt improved the performance of LLMs in conciseness and similarity, but they still could not fully replace radiologists in impression summarization. Among all evaluated LLMs, no model performed the best in all tasks, and commercial LLMs were generally superior to open-source LLMs. Additionally, the study pointed out that LLMs' performance varied among different types of radiology reports, with Tongyi Qianwen performing the best in PET-CT reports and ERNIE Bot performing the best in CT reports. In conclusion, the paper aims to clarify the current status of LLMs in Chinese radiology report impression summaries and highlight the gaps in their practical clinical application.