Syringoma resembling confluent and reticulated papillomatosis of Gougerot-Carteaud.

M. W. Lee,L. Goldberg,K. Dorsey,S. Baer

Abstract:We report the case of a 31-year-old woman with a rare presentation of syringoma resembling confluent and reticulated papillomatosis of Gougerot-Carteaud. The lesions have been unresponsive to treatment with topical steroids and retinoic acid.

What problem does this paper attempt to address?

Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing

Carlo A. Mallio,Andrea C. Sertorio,Caterina Bernetti,Bruno Beomonte Zobel

DOI: https://doi.org/10.1007/s11547-023-01651-4

2023-05-30

Abstract:Structured reporting may improve the radiological workflow and communication among physicians. Artificial intelligence applications in medicine are growing fast. Large language models (LLMs) are recently gaining importance as valuable tools in radiology and are currently being tested for the critical task of structured reporting. We compared four LLMs models in terms of knowledge on structured reporting and templates proposal. LLMs hold a great potential for generating structured reports in radiology but additional formal validations are needed on this topic.
Evaluation of large language models in breast cancer clinical scenarios: A comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2

Linfang Deng,Tianyi Wang,Yangzhang,Zhenhua Zhai,Wei Tao,Jincheng Li,Yi Zhao,Jinjiang Xu,Shaoting Luo

DOI: https://doi.org/10.1097/js9.0000000000001066

2024-01-23

International Journal of Surgery

Abstract:Background Large language models (LLMs) have garnered significant attention in the AI domain owing to their exemplary context recognition and response capabilities. However, the potential of LLMs in specific clinical scenarios, particularly in breast cancer diagnosis, treatment, and care, has not been fully explored. This study aimed to compare the performances of three major LLMs in the clinical context of breast cancer. Methods In this study, clinical scenarios designed specifically for breast cancer were segmented into five pivotal domains (nine cases): assessment and diagnosis, treatment decision-making, post-operative care, psychosocial support, and prognosis and rehabilitation. The LLMs were used to generate feedback for various queries related to these domains. For each scenario, a panel of five breast cancer specialists, each with over a decade of experience, evaluated the feedback from LLMs. They assessed feedback concerning LLMs in terms of their quality, relevance, and applicability. Results There was a moderate level of agreement among the raters ( Fleiss’ kappa =0.345, P <0.05). Comparing the performance of different models regarding response length, GPT-4.0 and GPT-3.5 provided relatively longer feedback than Claude2. Furthermore, across the nine case analyses, GPT-4.0 significantly outperformed the other two models in average quality, relevance, and applicability. Within the five clinical areas, GPT-4.0 markedly surpassed GPT-3.5 in the quality of the other four areas and scored higher than Claude2 in tasks related to psychosocial support and treatment decision-making. Conclusion This study revealed that in the realm of clinical applications for breast cancer, GPT-4.0 showcases not only superiority in terms of quality and relevance but also demonstrates exceptional capability in applicability, especially when compared to GPT-3.5. Relative to Claude2, GPT-4.0 holds advantages in specific domains. With the expanding use of LLMs in the clinical field, ongoing optimization and rigorous accuracy assessments are paramount.

surgery
BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study

Andrea Cozzi,Katja Pinker,Andri Hidber,Tianyu Zhang,Luca Bonomo,Roberto Lo Gullo,Blake Christianson,Marco Curti,Stefania Rizzo,Filippo Del Grande,Ritse M. Mann,Simone Schiaffino,Ariane Panzer

DOI: https://doi.org/10.1148/radiol.232133

IF: 19.7

2024-05-01

Radiology

Abstract:Background The performance of publicly available large language models (LLMs) remains unclear for complex clinical tasks. Purpose To evaluate the agreement between human readers and LLMs for Breast Imaging Reporting and Data System (BI-RADS) categories assigned based on breast imaging reports written in three languages and to assess the impact of discordant category assignments on clinical management. Materials and Methods This retrospective study included reports for women who underwent MRI,...

radiology, nuclear medicine & medical imaging
Evaluating Large Language Models for Radiology Natural Language Processing

Zhengliang Liu,Tianyang Zhong,Yiwei Li,Yutong Zhang,Yi Pan,Zihao Zhao,Peixin Dong,Chao Cao,Yuxiao Liu,Peng Shu,Yaonai Wei,Zihao Wu,Chong Ma,Jiaqi Wang,Sheng Wang,Mengyue Zhou,Zuowei Jiang,Chunlin Li,Jason Holmes,Shaochen Xu,Lu Zhang,Haixing Dai,Kai Zhang,Lin Zhao,Yuanhao Chen,Xu Liu,Peilong Wang,Pingkun Yan,Jun Liu,Bao Ge,Lichao Sun,Dajiang Zhu,Xiang Li,Wei Liu,Xiaoyan Cai,Xintao Hu,Xi Jiang,Shu Zhang,Xin Zhang,Tuo Zhang,Shijie Zhao,Quanzheng Li,Hongtu Zhu,Dinggang Shen,Tianming Liu

2023-07-27

Abstract:The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a comprehensive evaluation of these models remains to be conducted. This lack of assessment is especially apparent within the context of radiology NLP. This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports, a crucial component of radiology NLP. Specifically, the ability to derive impressions from radiologic findings is assessed. The outcomes of this evaluation provide key insights into the performance, strengths, and weaknesses of these LLMs, informing their practical applications within the medical domain.

Computation and Language
Assessing Large Language Models for Oncology Data Inference from Radiology Reports

Li-Ching Chen,Travis Zack,Arda Demirci,Madhumita Sushil,Brenda Miao,Corynn Kasap,Atul J Butte,Eric Collisson,Julian Hong

DOI: https://doi.org/10.1101/2024.05.23.24307579

2024-05-23

Abstract:Key Points Purpose: We examined the effectiveness of proprietary and open Large Language Models (LLMs) in detecting disease presence, location, and treatment response in pancreatic cancer from radiology reports. Methods: We analyzed 203 deidentified radiology reports, manually annotated for disease status, location, and indeterminate nodules needing follow-up. Utilizing GPT-4, GPT-3.5-turbo, and open models like Gemma-7B and Llama3-8B, we employed strategies such as ablation and prompt engineering to boost accuracy. Discrepancies between human and model interpretations were reviewed by a secondary oncologist. Results: Among 164 pancreatic adenocarcinoma patients, GPT-4 showed the highest accuracy in inferring disease status, achieving a 75.5% correctness (F1-micro). Open models Mistral-7B and Llama3-8B performed comparably, with accuracies of 68.6% and 61.4%, respectively. Mistral-7B excelled in deriving correct inferences from "Objective Findings" directly. Most tested models demonstrated proficiency in identifying disease containing anatomical locations from a list of choices, with GPT-4 and Llama3-8B showing near parity in precision and recall for disease site identification. However, open models struggled with differentiating benign from malignant post-surgical changes, impacting their precision in identifying findings indeterminate for cancer. A secondary review occasionally favored GPT-3.5's interpretations, indicating the variability in human judgment. Conclusion: LLMs, especially GPT-4, are proficient in deriving oncological insights from radiology reports. Their performance is enhanced by effective summarization strategies, demonstrating their potential in clinical support and healthcare analytics. This study also underscores the possibility of zero-shot open model utility in environments where proprietary models are restricted. Finally, by providing a set of annotated radiology reports, this paper presents a valuable dataset for further LLM research in oncology.
Automatic Extraction of Imaging Observation and Assessment Categories from Breast Magnetic Resonance Imaging Reports with Natural Language Processing

Yi Liu,Li-Na Zhu,Qing Liu,Chao Han,Xiao-Dong Zhang,Xiao-Ying Wang

DOI: https://doi.org/10.1097/cm9.0000000000000301

IF: 6.133

2019-01-01

Chinese Medical Journal

Abstract:Background: Structured reports are not widely used and thus most reports exist in the form of free text. The process of data extraction by experts is time-consuming and error-prone, whereas data extraction by natural language processing (NIP) is a potential solution that could improve diagnosis efficiency and accuracy. The purpose of this study was to evaluate an NLP program that determines American College of Radiology Breast Imaging Reporting and Data System (BI-RADS) descriptors and final assessment categories from breast magnetic resonance imaging (MRI) reports. Methods: This cross-sectional study involved 2330 breast MRI reports in the electronic medical record from 2009 to 2017. We used 1635 reports for the creation of a revised BI-RADS MRI lexicon and synonyms lists as well as the iterative development of an NLP system. The remaining 695 reports that were not used for developing the system were used as an independent test set for the final evaluation of the NLP system. The recall and precision of an NLP algorithm to detect the revised BI-RADS MRI descriptors and BI-RADS categories from the free-text reports were evaluated against a standard reference of manual human review. Results: There was a high level of agreement between two manual reviewers, with a kappa value of 0.95. For all breast imaging reports, the NLP algorithm demonstrated a recall of 78.5% and a precision of 86.1% for correct identification of the revised BI-RADS MRI descriptors and the BI-RADS categories. NLP generated the total results in <1 s, whereas the manual reviewers averaged 3.38 and 3.23 min per report, respectively. Conclusions: The NLP algorithm demonstrates high recall and precision for information extraction from free-text reports. This approach will help to narrow the gap between unstructured report text and structured data, which is needed in decision support and other applications.
General-Purpose vs. Domain-Adapted Large Language Models for Extraction of Structured Data from Chest Radiology Reports

Ali H. Dhanaliwala,Rikhiya Ghosh,Sanjeev Kumar Karn,Poikavila Ullaskrishnan,Oladimeji Farri,Dorin Comaniciu,Charles E. Kahn

2024-04-09

Abstract:Radiologists produce unstructured data that can be valuable for clinical care when consumed by information systems. However, variability in style limits usage. Study compares system using domain-adapted language model (RadLing) and general-purpose LLM (GPT-4) in extracting relevant features from chest radiology reports and standardizing them to common data elements (CDEs). Three radiologists annotated a retrospective dataset of 1399 chest XR reports (900 training, 499 test) and mapped to 44 pre-selected relevant CDEs. GPT-4 system was prompted with report, feature set, value set, and dynamic few-shots to extract values and map to CDEs. Output key:value pairs were compared to reference standard at both stages and an identical match was considered TP. F1 score for extraction was 97% for RadLing-based system and 78% for GPT-4 system. F1 score for mapping was 98% for RadLing and 94% for GPT-4; difference was statistically significant (P<.001). RadLing's domain-adapted embeddings were better in feature extraction and its light-weight mapper had better f1 score in CDE assignment. RadLing system also demonstrated higher capabilities in differentiating between absent (99% vs 64%) and unspecified (99% vs 89%). RadLing system's domain-adapted embeddings helped improve performance of GPT-4 system to 92% by giving more relevant few-shot prompts. RadLing system offers operational advantages including local deployment and reduced runtime costs.

Computation and Language,Image and Video Processing
Development of a quantitative real-time PCR method to enumerate total bacterial counts in ready-to-eat fruits and vegetables.

Hajime Takahashi,H. Konuma,Y. Hara-Kudo

DOI: https://doi.org/10.4315/0362-028X-69.10.2504

IF: 2.745

2006-10-01

Journal of Food Protection

Abstract:A newly developed real-time PCR assay rapidly quantifies the total bacterial numbers in contaminated ready-to-eat vegetables and fruits compared with the standard plate count method. Primers targeting the rpoB gene, which encodes for the beta subunit of the bacterial RNA polymerase and which is common to most bacterial species, was used instead of the 16S rRNA gene, which has multiple copies and varies among bacterial species. A primer pair specific for rpoB was confirmed to amplify rpoB in a wide range of bacterial species after we assessed 49 strains isolated from five kinds of fruits and vegetables. We purchased fruits and vegetables from retail shops and enumerated the bacteria associated with them by use of real-time PCR and compared this to the number found by the culture method. We found a high correlation between the threshold PCR cycle number when compared with the plate count culture number. The real-time PCR assay developed in this study can enumerate the dominant bacterial species in ready-to-eat fruits and vegetables.
Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

Zhongheng Zhang,Hongying Ni

DOI: https://doi.org/10.1016/j.jointm.2024.09.002

2024-01-01

Journal of Intensive Medicine

Abstract:The integration of large language models (LLMs) in clinical medicine, particularly in critical care, has introduced transformative capabilities for analyzing and managing complex medical information. This technical note explores the application of LLMs, such as generative pretrained transformer 4 (GPT-4) and Qwen-Chat, in interpreting electronic healthcare records to assist with rapid patient condition assessments, predict sepsis, and automate the generation of discharge summaries. The note emphasizes the significance of LLMs in processing unstructured data from electronic health records (EHRs), extracting meaningful insights, and supporting personalized medicine through nuanced understanding of patient histories. Despite the technical complexity of deploying LLMs in clinical settings, this document provides a comprehensive guide to facilitate the effective integration of LLMs into clinical workflows, focusing on the use of DashScope's application programming interface (API) services for judgment on patient prognosis and organ support recommendations based on natural language in EHRs. By illustrating practical steps and best practices, this work aims to lower the technical barriers for clinicians and researchers, enabling broader adoption of LLMs in clinical research and practice to enhance patient care and outcomes.
A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports

Madhumita Sushil,Travis Zack,Divneet Mandair,Zhiwei Zheng,Ahmed Wali,Yan-Ning Yu,Yuwei Quan,Dmytro Lituiev,Atul J Butte

DOI: https://doi.org/10.1093/jamia/ocae146

2024-10-01

Abstract:Objective: Although supervised machine learning is popular for information extraction from clinical notes, creating large annotated datasets requires extensive domain expertise and is time-consuming. Meanwhile, large language models (LLMs) have demonstrated promising transfer learning capability. In this study, we explored whether recent LLMs could reduce the need for large-scale data annotations. Materials and methods: We curated a dataset of 769 breast cancer pathology reports, manually labeled with 12 categories, to compare zero-shot classification capability of the following LLMs: GPT-4, GPT-3.5, Starling, and ClinicalCamel, with task-specific supervised classification performance of 3 models: random forests, long short-term memory networks with attention (LSTM-Att), and the UCSF-BERT model. Results: Across all 12 tasks, the GPT-4 model performed either significantly better than or as well as the best supervised model, LSTM-Att (average macro F1-score of 0.86 vs 0.75), with advantage on tasks with high label imbalance. Other LLMs demonstrated poor performance. Frequent GPT-4 error categories included incorrect inferences from multiple samples and from history, and complex task design, and several LSTM-Att errors were related to poor generalization to the test set. Discussion: On tasks where large annotated datasets cannot be easily collected, LLMs can reduce the burden of data labeling. However, if the use of LLMs is prohibitive, the use of simpler models with large annotated datasets can provide comparable results. Conclusions: GPT-4 demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for large annotated datasets. This may increase the utilization of NLP-based variables and outcomes in clinical studies.
Using GPT‐4 for LI‐RADS feature extraction and categorization with multilingual free‐text reports

Kyowon Gu,Jeong Hyun Lee,Jaeseung Shin,Jeong Ah Hwang,Ji Hye Min,Woo Kyoung Jeong,Min Woo Lee,Kyoung Doo Song,Sung Hwan Bae

DOI: https://doi.org/10.1111/liv.15891

IF: 8.754

2024-04-25

Liver International

Abstract:Background and Aims The Liver Imaging Reporting and Data System (LI‐RADS) offers a standardized approach for imaging hepatocellular carcinoma. However, the diverse styles and structures of radiology reports complicate automatic data extraction. Large language models hold the potential for structured data extraction from free‐text reports. Our objective was to evaluate the performance of Generative Pre‐trained Transformer (GPT)‐4 in extracting LI‐RADS features and categories from free‐text liver magnetic resonance imaging (MRI) reports. Methods Three radiologists generated 160 fictitious free‐text liver MRI reports written in Korean and English, simulating real‐world practice. Of these, 20 were used for prompt engineering, and 140 formed the internal test cohort. Seventy‐two genuine reports, authored by 17 radiologists were collected and de‐identified for the external test cohort. LI‐RADS features were extracted using GPT‐4, with a Python script calculating categories. Accuracies in each test cohort were compared. Results On the external test, the accuracy for the extraction of major LI‐RADS features, which encompass size, nonrim arterial phase hyperenhancement, nonperipheral 'washout', enhancing 'capsule' and threshold growth, ranged from .92 to .99. For the rest of the LI‐RADS features, the accuracy ranged from .86 to .97. For the LI‐RADS category, the model showed an accuracy of .85 (95% CI: .76, .93). Conclusions GPT‐4 shows promise in extracting LI‐RADS features, yet further refinement of its prompting strategy and advancements in its neural network architecture are crucial for reliable use in processing complex real‐world MRI reports.

gastroenterology & hepatology
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition

Yasin Celal Güneş,Turay Cesur,Eren Çamur,Leman Günbey Karabekmez

DOI: https://doi.org/10.4274/dir.2024.242876

2024-09-09

Abstract:Purpose: This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions. Methods: This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5th edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar's and chi-squared tests. Management scores were analyzed using the Kruskal- Wallis and Wilcoxon tests. Results: Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) (P < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) (P > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists (P > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs (P < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 ± 0.97) (P < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus (P < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories (P < 0.05). Conclusion: Although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show promise in text-based BI-RADS assessments, their limitations in visual diagnostics suggest they should be used cautiously and under radiologists' supervision to avoid misdiagnoses. Clinical significance: This study demonstrates that while LLMs exhibit strong capabilities in text-based BI-RADS assessments, their visual diagnostic abilities are currently limited, necessitating further development and cautious application in clinical practice.
Evaluating Microsoft Bing with ChatGPT-4 for the assessment of abdominal computed tomography and magnetic resonance images

Alperen Elek,Duygu Doğa Ekizalioğlu,Ezgi Güler

DOI: https://doi.org/10.4274/dir.2024.232680

2024-08-19

Abstract:Purpose: To evaluate the performance of Microsoft Bing with ChatGPT-4 technology in analyzing abdominal computed tomography (CT) and magnetic resonance images (MRI). Methods: A comparative and descriptive analysis was conducted using the institutional picture archiving and communication systems. A total of 80 abdominal images (44 CT, 36 MRI) that showed various entities affecting the abdominal structures were included. Microsoft Bing's interpretations were compared with the impressions of radiologists in terms of recognition of the imaging modality, identification of the imaging planes (axial, coronal, and sagittal), sequences (in the case of MRI), contrast media administration, correct identification of the anatomical region depicted in the image, and detection of abnormalities. Results: Microsoft Bing detected that the images were CT scans with 95.4% accuracy (42/44) and that the images were MRI scans with 86.1% accuracy (31/36). However, it failed to detect one CT image (2.3%) and misidentified another CT image as an MRI (2.3%). On the other hand, it also misidentified four MRI as CT images (11.1%) and one as an X-ray (2.7%). Bing achieved an 83.75% success rate in correctly identifying abdominal regions, with 90% accuracy for CT scans (40/44) and 77.7% for MRI scans (28/36). Concerning the identification of imaging planes, Bing achieved a success rate of 95.4% for CT images and 83.3% for MRI. Regarding the identification of MRI sequences (T1-weighted and T2-weighted), the success rate was 68.75%. In the identification of the use of contrast media for CT scans, the success rate was 64.2%. Bing detected abnormalities in 35% of the images but achieved a correct interpretation rate of 10.7% for the definite diagnosis. Conclusion: While Microsoft Bing, leveraging ChatGPT-4 technology, demonstrates proficiency in basic task identification on abdominal CT and MRI, its inability to reliably interpret abnormalities highlights the need for continued refinement to enhance its clinical applicability. Clinical significance: The contribution of large language models (LLMs) to the diagnostic process in radiology is still being explored. However, with a comprehensive understanding of their capabilities and limitations, LLMs can significantly support radiologists during diagnosis and improve the overall efficiency of abdominal radiology practices. Acknowledging the limitations of current studies related to ChatGPT in this field, our work provides a foundation for future clinical research, paving the way for more integrated and effective diagnostic tools.
Fine-Tuning In-House Large Language Models to Infer Differential Diagnosis from Radiology Reports

Luoyao Chen,Revant Teotia,Antonio Verdone,Aidan Cardall,Lakshay Tyagi,Yiqiu Shen,Sumit Chopra

2024-10-12

Abstract:Radiology reports summarize key findings and differential diagnoses derived from medical imaging examinations. The extraction of differential diagnoses is crucial for downstream tasks, including patient management and treatment planning. However, the unstructured nature of these reports, characterized by diverse linguistic styles and inconsistent formatting, presents significant challenges. Although proprietary large language models (LLMs) such as GPT-4 can effectively retrieve clinical information, their use is limited in practice by high costs and concerns over the privacy of protected health information (PHI). This study introduces a pipeline for developing in-house LLMs tailored to identify differential diagnoses from radiology reports. We first utilize GPT-4 to create 31,056 labeled reports, then fine-tune open source LLM using this dataset. Evaluated on a set of 1,067 reports annotated by clinicians, the proposed model achieves an average F1 score of 92.1\%, which is on par with GPT-4 (90.8\%). Through this study, we provide a methodology for constructing in-house LLMs that: match the performance of GPT, reduce dependence on expensive proprietary models, and enhance the privacy and security of PHI.

Computation and Language
Applications of Large Language Models (LLMs) in Breast Cancer Care

Vera Sorin,Benjamin S. Glicksberg,Yiftach Barash,Eli Konen,Girish Nadkarni,Eyal Klang,Benjamin S Glicksberg

DOI: https://doi.org/10.1101/2023.11.04.23298081

2023-11-05

MedRxiv

Abstract:Purpose: Recently introduced Large Language Models (LLMs) such as ChatGPT have already shown promising results in natural language processing in healthcare. The aim of this study is to systematically review the literature on the applications of LLMs in breast cancer diagnosis and care. Methods: A literature search was conducted using MEDLINE, focusing on studies published up to October 22nd, 2023, using the following terms: large language models, GPT, ChatGPT, OpenAI, and breast. Results: Five studies met our inclusion criteria. All studies were published in 2023, focusing on ChatGPT-3.5 or GPT-4 by OpenAI. Applications included information extraction from clinical notes, question-answering based on guidelines, and patients' management recommendations. The rate of correct answers varied from 64-98%, with the highest accuracy (88-98%) observed in information extraction and question-answering tasks. Notably, most studies utilized real patient data rather than data sourced from the internet. Limitations included inconsistent accuracy, prompt sensitivity, and overlooked clinical details, highlighting areas for cautious LLM integration into clinical practice. Conclusion: LLMs demonstrate promise in text analysis tasks related to breast cancer care, including information extraction and guideline-based question-answering. However, variations in accuracy and the occurrence of erroneous outputs necessitate validation and oversight. Future works should focus on improving reliability of LLMs within clinical workflow.
Large language models for structured reporting in radiology: past, present, and future

Felix Busch,Lena Hoffmann,Daniel Pinto dos Santos,Marcus R. Makowski,Luca Saba,Philipp Prucker,Martin Hadamitzky,Nassir Navab,Jakob Nikolas Kather,Daniel Truhn,Renato Cuocolo,Lisa C. Adams,Keno K. Bressem

DOI: https://doi.org/10.1007/s00330-024-11107-6

IF: 7.034

2024-10-24

European Radiology

Abstract:Structured reporting (SR) has long been a goal in radiology to standardize and improve the quality of radiology reports. Despite evidence that SR reduces errors, enhances comprehensiveness, and increases adherence to guidelines, its widespread adoption has been limited. Recently, large language models (LLMs) have emerged as a promising solution to automate and facilitate SR. Therefore, this narrative review aims to provide an overview of LLMs for SR in radiology and beyond. We found that the current literature on LLMs for SR is limited, comprising ten studies on the generative pre-trained transformer (GPT)-3.5 ( n = 5) and/or GPT-4 ( n = 8), while two studies additionally examined the performance of Perplexity and Bing Chat or IT5. All studies reported promising results and acknowledged the potential of LLMs for SR, with six out of ten studies demonstrating the feasibility of multilingual applications. Building upon these findings, we discuss limitations, regulatory challenges, and further applications of LLMs in radiology report processing, encompassing four main areas: documentation, translation and summarization, clinical evaluation, and data mining. In conclusion, this review underscores the transformative potential of LLMs to improve efficiency and accuracy in SR and radiology report processing.

radiology, nuclear medicine & medical imaging
Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports

Qingqing Zhu,Xiuying Chen,Qiao Jin,Benjamin Hou,Tejas Sudharshan Mathai,Pritam Mukherjee,Xin Gao,Ronald M Summers,Zhiyong Lu

2024-02-17

Abstract:In radiology, Artificial Intelligence (AI) has significantly advanced report generation, but automatic evaluation of these AI-produced reports remains challenging. Current metrics, such as Conventional Natural Language Generation (NLG) and Clinical Efficacy (CE), often fall short in capturing the semantic intricacies of clinical contexts or overemphasize clinical details, undermining report clarity. To overcome these issues, our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs), like GPT-3.5 and GPT-4 1. Utilizing In-Context Instruction Learning (ICIL) and Chain of Thought (CoT) reasoning, our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports. This is further enhanced by a Regression model that aggregates sentence evaluation scores. Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19, while our "Regressed GPT-4" model shows even greater alignment with expert evaluations, exceeding the best existing metric by a 0.35 margin. Moreover, the robustness of our explanations has been validated through a thorough iterative strategy. We plan to publicly release annotations from radiology experts, setting a new standard for accuracy in future assessments. This underscores the potential of our approach in enhancing the quality assessment of AI-driven medical reports.

Computation and Language,Artificial Intelligence
The Potential of Gemini and GPTs for Structured Report Generation based on Free-Text 18F-FDG PET/CT Breast Cancer Reports

Kun Chen,Wengui Xu,Xiaofeng Li

DOI: https://doi.org/10.1016/j.acra.2024.08.052

2024-09-07

Abstract:Rationale and objective: To compare the performance of large language model (LLM) based Gemini and Generative Pre-trained Transformers (GPTs) in data mining and generating structured reports based on free-text PET/CT reports for breast cancer after user-defined tasks. Materials and methods: Breast cancer patients (mean age, 50 years ± 11 [SD]; all female) who underwent consecutive 18F-FDG PET/CT for follow-up between July 2005 and October 2023 were retrospectively included in the study. A total of twenty reports from 10 patients were used to train user-defined text prompts for Gemini and GPTs, by which structured PET/CT reports were generated. The natural language processing (NLP) generated structured reports and the structured reports annotated by nuclear medicine physicians were compared in terms of data extraction accuracy and capacity of progress decision-making. Statistical methods, including chi-square test, McNemar test and paired samples t-test, were employed in the study. Results: The structured PET/CT reports for 131 patients were generated by using the two NLP techniques, including Gemini and GPTs. In general, GPTs exhibited superiority over Gemini in data mining in terms of primary lesion size (89.6% vs. 53.8%, p < 0.001) and metastatic lesions (96.3% vs 89.6%, p < 0.001). Moreover, GPTs outperformed Gemini in making decision for progress (p < 0.001) and semantic similarity (F1 score 0.930 vs 0.907, p < 0.001) for reports. Conclusion: GPTs outperformed Gemini in generating structured reports based on free-text PET/CT reports, which is potentially applied in clinical practice. Data availability: The data used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context

Ying Piao,Hongtao Chen,Shihai Wu,Xianming Li,Zihuang Li,Dong Yang

DOI: https://doi.org/10.1177/20552076241284771

2024-10-07

Abstract:Purpose: Large language models (LLMs) are deep learning models designed to comprehend and generate meaningful responses, which have gained public attention in recent years. The purpose of this study is to evaluate and compare the performance of LLMs in answering questions regarding breast cancer in the Chinese context. Material and methods: ChatGPT, ERNIE Bot, and ChatGLM were chosen to answer 60 questions related to breast cancer posed by two oncologists. Responses were scored as comprehensive, correct but inadequate, mixed with correct and incorrect data, completely incorrect, or unanswered. The accuracy, length, and readability among answers from different models were evaluated using statistical software. Results: ChatGPT answered 60 questions, with 40 (66.7%) comprehensive answers and six (10.0%) correct but inadequate answers. ERNIE Bot answered 60 questions, with 34 (56.7%) comprehensive answers and seven (11.7%) correct but inadequate answers. ChatGLM generated 60 answers, with 35 (58.3%) comprehensive answers and six (10.0%) correct but inadequate answers. The differences for chosen accuracy metrics among the three LLMs did not reach statistical significance, but only ChatGPT demonstrated a sense of human compassion. The accuracy of the three models in answering questions regarding breast cancer treatment was the lowest, with an average of 44.4%. ERNIE Bot's responses were significantly shorter compared to ChatGPT and ChatGLM (p < .001 for both). The readability scores of the three models showed no statistical significance. Conclusions: In the Chinese context, the capabilities of ChatGPT, ERNIE Bot, and ChatGLM are similar in answering breast cancer-related questions at present. These three LLMs may serve as adjunct informational tools for breast cancer patients in the Chinese context, offering guidance for general inquiries. However, for highly specialized issues, particularly in the realm of breast cancer treatment, LLMs cannot deliver reliable performance. It is necessary to utilize them under the supervision of healthcare professionals.
Development and Validation of a Dynamic-Template-Constrained Large Language Model for Generating Fully-Structured Radiology Reports

Chuang Niu,Parisa Kaviani,Qing Lyu,Mannudeep K. Kalra,Christopher T. Whitlow,Ge Wang

2024-10-25

Abstract:Current LLMs for creating fully-structured reports face the challenges of formatting errors, content hallucinations, and privacy leakage issues when uploading data to external <a class="link-external link-http" href="http://servers.We" rel="external noopener nofollow">this http URL</a> aim to develop an open-source, accurate LLM for creating fully-structured and standardized LCS reports from varying free-text reports across institutions and demonstrate its utility in automatic statistical analysis and individual lung nodule retrieval. With IRB approvals, our retrospective study included 5,442 de-identified LDCT LCS radiology reports from two institutions. We constructed two evaluation datasets by labeling 500 pairs of free-text and fully-structured radiology reports and one large-scale consecutive dataset from January 2021 to December 2023. Two radiologists created a standardized template for recording 27 lung nodule features on LCS. We designed a dynamic-template-constrained decoding method to enhance existing LLMs for creating fully-structured reports from free-text radiology reports. Using consecutive structured reports, we automated descriptive statistical analyses and a nodule retrieval prototype. Our best LLM for creating fully-structured reports achieved high performance on cross-institutional datasets with an F1 score of about 97%, with neither formatting errors nor content hallucinations. Our method consistently improved the best open-source LLMs by up to 10.42%, and outperformed GPT-4o by 17.19%. The automatically derived statistical distributions were consistent with prior findings regarding attenuation, location, size, stability, and Lung-RADS. The retrieval system with structured reports allowed flexible nodule-level search and complex statistical analysis. Our developed software is publicly available for local deployment and further research.

Artificial Intelligence,Computation and Language

Syringoma resembling confluent and reticulated papillomatosis of Gougerot-Carteaud.

Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing

Evaluation of large language models in breast cancer clinical scenarios: A comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2

BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study

Evaluating Large Language Models for Radiology Natural Language Processing

Assessing Large Language Models for Oncology Data Inference from Radiology Reports

Automatic Extraction of Imaging Observation and Assessment Categories from Breast Magnetic Resonance Imaging Reports with Natural Language Processing

General-Purpose vs. Domain-Adapted Large Language Models for Extraction of Structured Data from Chest Radiology Reports

Development of a quantitative real-time PCR method to enumerate total bacterial counts in ready-to-eat fruits and vegetables.

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports

Using GPT‐4 for LI‐RADS feature extraction and categorization with multilingual free‐text reports

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition

Evaluating Microsoft Bing with ChatGPT-4 for the assessment of abdominal computed tomography and magnetic resonance images

Fine-Tuning In-House Large Language Models to Infer Differential Diagnosis from Radiology Reports

Applications of Large Language Models (LLMs) in Breast Cancer Care

Large language models for structured reporting in radiology: past, present, and future

Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports

The Potential of Gemini and GPTs for Structured Report Generation based on Free-Text 18F-FDG PET/CT Breast Cancer Reports

Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context

Development and Validation of a Dynamic-Template-Constrained Large Language Model for Generating Fully-Structured Radiology Reports