Abstract:Purpose To assess the performance of a local open-source large language model (LLM) in various information extraction tasks from real-life emergency brain MRI reports. Materials and Methods All consecutive emergency brain MRI reports written in 2022 from a French quaternary center were retrospectively reviewed. Two radiologists identified MRI scans that were performed in the emergency department for headaches. Four radiologists scored the reports' conclusions as either normal or abnormal. Abnormalities were labeled as either headache-causing or incidental. Vicuna (LMSYS Org), an open-source LLM, performed the same tasks. Vicuna's performance metrics were evaluated using the radiologists' consensus as the reference standard. Results Among the 2398 reports during the study period, radiologists identified 595 that included headaches in the indication (median age of patients, 35 years [IQR, 26-51 years]; 68% [403 of 595] women). A positive finding was reported in 227 of 595 (38%) cases, 136 of which could explain the headache. The LLM had a sensitivity of 98.0% (95% CI: 96.5, 99.0) and specificity of 99.3% (95% CI: 98.8, 99.7) for detecting the presence of headache in the clinical context, a sensitivity of 99.4% (95% CI: 98.3, 99.9) and specificity of 98.6% (95% CI: 92.2, 100.0) for the use of contrast medium injection, a sensitivity of 96.0% (95% CI: 92.5, 98.2) and specificity of 98.9% (95% CI: 97.2, 99.7) for study categorization as either normal or abnormal, and a sensitivity of 88.2% (95% CI: 81.6, 93.1) and specificity of 73% (95% CI: 62, 81) for causal inference between MRI findings and headache. Conclusion An open-source LLM was able to extract information from free-text radiology reports with excellent accuracy without requiring further training. Keywords: Large Language Model (LLM), Generative Pretrained Transformers (GPT), Open Source, Information Extraction, Report, Brain, MRI Supplemental material is available for this article. Published under a CC BY 4.0 license. See also the commentary by Akinci D'Antonoli and Bluethgen in this issue.

Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports

Is Open-Source There Yet? A Comparative Study on Commercial and Open-Source LLMs in Their Ability to Label Chest X-Ray Reports

Assessing Large Language Models for Oncology Data Inference from Radiology Reports

Fine-Tuning In-House Large Language Models to Infer Differential Diagnosis from Radiology Reports

Evaluating Large Language Models for Radiology Natural Language Processing

[Large language models from OpenAI, Google, Meta, X and Co. : The role of "closed" and "open" models in radiology]

Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing

An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study

Harnessing Large Language Models for Structured Reporting in Breast Ultrasound: A Comparative Study of Open AI (GPT-4.0) and Microsoft Bing (GPT-4)

Performance of Open-Source LLMs in Challenging Radiological Cases — A Benchmark Study on 1,933 Eurorad Case Reports

General-Purpose vs. Domain-Adapted Large Language Models for Extraction of Structured Data from Chest Radiology Reports

Radiology-GPT: A Large Language Model for Radiology

Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

The Diagnostic Performance of Large Language Models and General Radiologists in Thoracic Radiology Cases: A Comparative Study

Automated Spinal MRI Labelling from Reports Using a Large Language Model

Exploring the Boundaries of GPT-4 in Radiology

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports

LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise