Abstract:Background The specialization and complexity of radiology makes the automatic generation of radiologic impressions (ie, a diagnosis with differential diagnosis and management recommendations) challenging. Purpose To develop a large language model (LLM) that generates impressions based on imaging findings and to evaluate its performance in professional and linguistic dimensions. Materials and Methods Six radiologists recorded imaging examination findings from August 2 to 31, 2023, at Shanghai General Hospital and used the developed LLM before routinely writing report impressions for multiple radiologic modalities (CT, MRI, radiography, mammography) and anatomic sites (cranium and face, neck, chest, upper abdomen, lower abdomen, vessels, bone and joint, spine, breast), making necessary corrections and completing the radiologic impression. A subset was defined to investigate cases where the LLM-generated impressions differed from the final radiologist impressions by excluding identical and highly similar cases. An expert panel scored the LLM-generated impressions on a five-point Likert scale (5 = strongly agree) based on scientific terminology, coherence, specific diagnosis, differential diagnosis, management recommendations, correctness, comprehensiveness, harmlessness, and lack of bias. Results In this retrospective study, an LLM was pretrained using 20 GB of medical and general-purpose text data. The fine-tuning data set comprised 1.5 GB of data, including 800 radiology reports with paired instructions (describing the output task in natural language) and outputs. Test set 2 included data from 3988 patients (median age, 56 years [IQR, 40-68 years]; 2159 male). The median recall, precision, and F1 score of LLM-generated impressions were 0.775 (IQR, 0.56-1), 0.84 (IQR, 0.611-1), and 0.772 (IQR, 0.578-0.957), respectively, using the final impressions as the reference standard. In a subset of 1014 patients (median age, 57 years [IQR, 42-69 years]; 528 male), the overall median expert panel score for LLM-generated impressions was 5 (IQR, 5-5), ranging from 4 (IQR, 3-5) to 5 (IQR, 5-5). Conclusion The developed LLM generated radiologic impressions that were professionally and linguistically appropriate for a full spectrum of radiology examinations. © RSNA, 2024 Supplemental material is available for this article.

Evaluating Large Language Models for Radiology Natural Language Processing

Evaluating large language models in medical applications: a survey

Multi-modal large language models in radiology: principles, applications, and potential

Constructing a Large Language Model to Generate Impressions from Findings in Radiology Reports

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Radiology-GPT: A Large Language Model for Radiology

Understanding natural language: Potential application of large language models to ophthalmology

Large language models for structured reporting in radiology: past, present, and future

Advancing radiology practice and research: harnessing the potential of large language models amidst imperfections

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

A Survey on Evaluation of Large Language ModelsJust Accepted

A Survey on Evaluation of Large Language Models

The current status of large language models in summarizing radiology report impressions

Large Language Models for Disease Diagnosis: A Scoping Review

A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry

A Survey for Large Language Models in Biomedicine

Large Language Model Benchmarks in Medical Tasks

Large Language Models: A Guide for Radiologists

Large Language Models for Medicine: A Survey

Ophtha-LLaMA2: A Large Language Model for Ophthalmology