Transformative potential of Large Language Models in data mining on Electronic Health Records.

Amadeo Jesus Wals Zurita Sr.,Hector Miras del Rio Sr.,Nerea Ugarte Ruiz de Aguirre,Cristina Nebrera Navarro,Maria Rubio Jimenez,David Munoz Carmona,Carlos Miguez Sanchez
DOI: https://doi.org/10.1101/2024.03.07.24303588
2024-10-14
Abstract:Introduction: In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of Large Language Models (LLMs) in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators. Methods: We implemented a script using the OpenAI API to extract structured information in JSON format from comorbidities reported in 250 personal history reports. These reports were manually reviewed in batches of 50 by five specialists in radiation oncology. We compared the results using metrics such as Sensitivity, Specificity, Precision, Accuracy, F-value, Kappa index, and the McNemar test, in addition to examining the common causes of errors in both humans and GPT models. Results: The GPT-3.5 model exhibited slightly lower performance compared to physicians across all metrics, though the differences were not statistically significant (McNemars test p = 0.79). GPT-4 demonstrated clear superiority in several key metrics (McNemars test p < 0.001). Notably, it achieved a sensitivity of 96.8%, compared to 88.2% for GPT-3.5 and 88.8% for physicians. However, physicians marginally outperformed GPT-4 in precision (97.7% vs. 96.8%). GPT-4 showed greater consistency, replicating the exact same results in 76% of the reports across 10 repeated analyses, compared to 59% for GPT-3.5, indicating more stable and reliable performance. Physicians were more likely to miss explicit comorbidities, while the GPT models more frequently inferred non-explicit comorbidities, sometimes correctly, though this also resulted in more false positives. Conclusion: This study demonstrates that, with well-designed prompts, the LLMs examined can match or even surpass medical specialists in extracting information from complex clinical reports. Their superior efficiency in time and costs, along with easy integration with databases, makes them a valuable tool for large-scale data mining and real-world evidence generation.
Health Informatics
What problem does this paper attempt to address?