Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

Philip Chung,Christine T Fong,Andrew M Walters,Nima Aghaeepour,Meliha Yetisgen,Vikas N O'Reilly-Shah
2024-01-03
Abstract:We investigate whether general-domain large language models such as GPT-4 Turbo can perform risk stratification and predict post-operative outcome measures using a description of the procedure and a patient's clinical notes derived from the electronic health record. We examine predictive performance on 8 different tasks: prediction of ASA Physical Status Classification, hospital admission, ICU admission, unplanned admission, hospital mortality, PACU Phase 1 duration, hospital duration, and ICU duration. Few-shot and chain-of-thought prompting improves predictive performance for several of the tasks. We achieve F1 scores of 0.50 for ASA Physical Status Classification, 0.81 for ICU admission, and 0.86 for hospital mortality. Performance on duration prediction tasks were universally poor across all prompt strategies. Current generation large language models can assist clinicians in perioperative risk stratification on classification tasks and produce high-quality natural language summaries and explanations.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper aims to explore whether large language models in the general domain (such as GPT-4 Turbo) can perform preoperative risk stratification and postoperative outcome prediction based on surgical descriptions and clinical notes extracted from electronic health records (EHR). Specifically, the study investigates the performance of language models in the following eight tasks: 1. Assignment of the American Society of Anesthesiologists Physical Status Classification (ASA-PS). 2. Prediction of post-anesthesia care unit (PACU) length of stay. 3. Prediction of hospitalization. 4. Prediction of length of hospital stay. 5. Prediction of intensive care unit (ICU) admission. 6. Prediction of ICU length of stay. 7. Prediction of unplanned readmission. 8. Prediction of in-hospital mortality. The study improves the model's performance through few-shot learning and chain-of-thought (CoT) prompting strategies and evaluates the effectiveness of these methods across different tasks. The results show that the model performs excellently in certain classification tasks (such as ICU admission and in-hospital mortality prediction) but performs poorly in duration prediction tasks (such as PACU length of stay and length of hospital stay). Additionally, the study explores the impact of clinical note length on model performance and the effect of using summaries instead of original notes. Overall, the study demonstrates that current large language models can assist clinicians to some extent in perioperative risk assessment.