Accuracy of a Proprietary Large Language Model in Labeling Obstetric Incident Reports

Jeanene Johnson,Conner Brown,Grace Lee,Keith Morse
DOI: https://doi.org/10.1016/j.jcjq.2024.08.001
2024-08-06
Abstract:Background: Using the data collected through incident reporting systems is challenging, as it is a large volume of primarily qualitative information. Large language models (LLMs), such as ChatGPT, provide novel capabilities in text summarization and labeling that could support safety data trending and early identification of opportunities to prevent patient harm. This study assessed the capability of a proprietary LLM (GPT-3.5) to automatically label a cross-sectional sample of real-world obstetric incident reports. Methods: A sample of 370 incident reports submitted to inpatient obstetric units between December 2022 and May 2023 was extracted. Human-annotated labels were assigned by a clinician reviewer and considered gold standard. The LLM was prompted to label incident reports relying solely on its pretrained knowledge and information included in the prompt. Primary outcomes assessed were sensitivity, specificity, positive predictive value, and negative predictive value. A secondary outcome assessed the human-perceived quality of the model's justification for the label(s) applied. Results: The LLM demonstrated the ability to label incident reports with high sensitivity and specificity. The model applied a total of 79 labels compared to the reviewer's 49 labels. Overall sensitivity for the model was 85.7%, and specificity was 97.9%. Positive and negative predictive values were 53.2% and 99.6%, respectively. For 60.8% of labels, the reviewer approved of the model's justification for applying the label. Conclusion: The proprietary LLM demonstrated the ability to label obstetric incident reports with high sensitivity and specificity. LLMs offer the potential to enable more efficient use of data from incident reporting systems.
What problem does this paper attempt to address?