Abstract:Background: Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored. Objective: This study aims to evaluate 3 large language model chatbots-Claude-2, GPT-3.5, and GPT-4-on assigning RADS categories to radiology reports and assess the impact of different prompting strategies. Methods: This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss κ. Results: Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2's performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018. Conclusions: When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.

Improving mentoring for women in computer science fields (abstract)

Deep Learning-Based Natural Language Processing in Radiology: The Impact of Report Complexity, Disease Prevalence, Dataset Size, and Algorithm Type on Model Performance

The effect of associative strength on priming in the cerebral hemispheres

Highly accurate classification of chest radiographic reports using a deep learning natural language model pre-trained on 3.8 million text reports

Automatic text classification of prostate cancer malignancy scores in radiology reports using NLP models

Language Models for Hierarchical Classification of Radiology Reports With Attention Mechanisms, BERT, and GPT-4

Non-Participation in a Randomized Controlled Trial: The Effect on Clinical and Non-Clinical Variables

Application of a Domain-specific BERT for Detection of Speech Recognition Errors in Radiology Reports.

Practical Evaluation of ChatGPT Performance for Radiology Report Generation

Postinduction apnoea in dogs premedicated with acepromazine or dexmedetomidine and anaesthetized with alfaxalone or propofol.

BERT in Radiology: A Systematic Review of Natural Language Processing Applications

Automated labelling of radiology reports using natural language processing: Comparison of traditional and newer methods

Mouse embryos' fusion for the tetraploid complementation assay.

Translating Radiology Reports into Plain Language using ChatGPT and GPT-4 with Prompt Learning: Promising Results, Limitations, and Potential

Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

Evaluation of a BERT Natural Language Processing Model for Automating CT and MRI Triage and Protocol Selection

BI-RADS BERT & Using Section Segmentation to Understand Radiology Reports

Empowering Radiologists with ChatGPT-4o: Comparative Evaluation of Large Language Models and Radiologists in Cardiac Cases

BI-RADS BERT and Using Section Segmentation to Understand Radiology Reports

Multilayering of the capillary basal lamina in the granular cell tumor. A marker of cellular injury.