Accuracy, Consistency, and Hallucination of Large Language Models When Analyzing Unstructured Clinical Notes in Electronic Medical Records.

Savyasachi V. Shah
DOI: https://doi.org/10.1001/jamanetworkopen.2024.25953
2024-08-01
JAMA Network Open
Abstract:Researchers have been exploring advanced artificial intelligence (AI) techniques such as large language models (LLMs) to enhance data extraction from clinical notes in electronic medical records (EMRs), a critical need given the vast amounts of unstructured data in clinical notes. Evolving LLMs, their handling of details in prompting, and hallucinations in replication warrant a comparison with conventional approaches to understand the tradeoff between efficiency and accuracy. One such study by Burford et al 1 investigates the utility and efficacy of an LLM, ChatGPT-4 (OpenAI), in extracting helmet status from clinical narratives of patients involved in micromobility-related accidents. 1 This study by Burford et al 1 leverages data from the US Consumer Product Safety Commission (CPSC) National Electronic Injury Surveillance System (NEISS) spanning 2019 to 2022, including 54729 emergency department (ED) visits among patients with a micromobility accident. The primary objective was to compare the LLM’s performance with the text string–search approach in identifying whether patients were wearing helmets at the time of their accidents. Three different levels of prompting detail for the LLM (low, intermediate, and high) were employed, and the agreementwiththetextstring–searchapproachwasmeasuredusingCohenκteststatistics.Thetest-retestreliabilityofthehigh-detailpromptwasmeasuredacrossnewchatsessionson5differentdays using Fleiss κ statistics. Performance statistics were calculated for a criterion standard review in a small random sample of 400 records comparing results from the high-detail prompt and text string– search approach to classifications of helmet status generated by researchers reading the clinical notes. Burford and colleagues 1 found moderate agreement (Cohen κ = 0.74 [95% CI, 0.73-0.75]) for the low-detail prompt and weak agreement (Cohen κ = 0.53 [95% CI, 0.52-0.54]) for the intermediate-detail prompt compared with the text string–search approach. The high-detail prompt, which included comprehensive researcher-generated
Medicine,Computer Science
What problem does this paper attempt to address?