Abstract:Background: Although radiology reports are commonly used for lung cancer staging, this task can be challenging given radiologists' variable reporting styles as well as reports' potentially ambiguous and/or incomplete staging-related information. Objective: To compare performance of ChatGPT large-language models (LLMs) and human readers of varying experience in lung cancer staging using chest CT and FDG PET/CT free-text reports. Methods: This retrospective study included 700 patients (mean age, 73.8±29.5 years; 509 male, 191 female) from four institutions in Korea who underwent chest CT or FDG PET/CT for non-small cell lung cancer initial staging from January, 2020 to December, 2023. Examinations' reports used a free-text format, written exclusively in English or in mixed English and Korean. Two thoracic radiologists in consensus determined the overall stage group (IA, IB, IIA, IIB, IIIA, IIIB, IIIC, IVA, IVB) for each report using the AJCC 8th-edition staging system, establishing the reference standard. Three ChatGPT models (GPT-4o, GPT-4, GPT-3.5) determined an overall stage group for each report using a script-based application programming interface, zero-shot learning, and prompt incorporating a staging system summary. Six human readers (two fellowship-trained radiologists with lesser experience than the radiologists who determined the reference standard, two fellows, two residents) also independently determined overall stage groups. GPT-4o's overall accuracy for determining the correct stage among the nine groups was compared with that of the other LLMs and human readers using McNemar tests. Results: GPT-4o had an overall staging accuracy of 74.1%, significantly better than the accuracy of GPT-4 (70.1%, p=.02), GPT-3.5 (57.4%, p<.001), and resident 2 (65.7%, p<.001); significantly worse than the accuracy of fellowship-trained radiologist 1 (82.3%, p<.001) and fellowship-trained radiologist 2 (85.4%, p<.001); and not significantly different from the accuracy of fellow 1 (77.7%, p=.09), fellow 2 (75.6%, p=.53), and resident 1 (72.3%, p=.42). Conclusions: The best-performing model, GPT-4o, showed no significant difference in staging accuracy versus fellows, but significantly worse performance versus fellowship-trained radiologists. The findings do not support use of LLMs for lung cancer staging in place of expert healthcare professionals. Clinical Impact: The findings indicate the importance of domain expertise for performing complex specialized tasks such as cancer staging.

The Potential of Gemini and GPTs for Structured Report Generation based on Free-Text 18F-FDG PET/CT Breast Cancer Reports

Enhancing Diagnostic Accuracy and Efficiency with GPT-4-Generated Structured Reports: A Comprehensive Study

Syringoma resembling confluent and reticulated papillomatosis of Gougerot-Carteaud.

Lung Cancer Staging Using Chest CT and FDG PET/CT Free-Text Reports: Comparison Among Three ChatGPT Large-Language Models and Six Human Readers of Varying Experience

Transforming free-text radiology reports into structured reports using ChatGPT: A study on thyroid ultrasonography

Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports

Assessing Large Language Models for Oncology Data Inference from Radiology Reports

Automatic Personalized Impression Generation for PET Reports Using Large Language Models

The problem of responses less than the reporting limit in unsupervised pattern recognition.

Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing

Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4)

Using GPT‐4 for LI‐RADS feature extraction and categorization with multilingual free‐text reports

Translating Radiology Reports into Plain Language using ChatGPT and GPT-4 with Prompt Learning: Promising Results, Limitations, and Potential

Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential

Natural Language Processing Algorithm Used for Staging Pulmonary Oncology from Free-Text Radiological Reports: "Including PET-CT and Validation Towards Clinical Use"

Exploring Multilingual Large Language Models for Enhanced TNM classification of Radiology Report in lung cancer staging

Reshaping Free-Text Radiology Notes Into Structured Reports With Generative Transformers

Practical Evaluation of ChatGPT Performance for Radiology Report Generation

Extracting lung cancer staging descriptors from pathology reports: A generative language model approach

A critical assessment of using ChatGPT for extracting structured data from clinical notes

Evaluating ChatGPT as an adjunct for the multidisciplinary tumor board decision-making in primary breast cancer cases