Humans Continue to Outperform Large Language Models in Complex Clinical Decision-Making: A Study with Medical Calculators

Nicholas Wan,Qiao Jin,Joey Chan,Guangzhi Xiong,Serina Applebaum,Aidan Gilson,Reid McMurry,R. Andrew Taylor,Aidong Zhang,Qingyu Chen,Zhiyong Lu
2024-11-08
Abstract:Although large language models (LLMs) have been assessed for general medical knowledge using medical licensing exams, their ability to effectively support clinical decision-making tasks, such as selecting and using medical calculators, remains uncertain. Here, we evaluate the capability of both medical trainees and LLMs to recommend medical calculators in response to various multiple-choice clinical scenarios such as risk stratification, prognosis, and disease diagnosis. We assessed eight LLMs, including open-source, proprietary, and domain-specific models, with 1,009 question-answer pairs across 35 clinical calculators and measured human performance on a subset of 100 questions. While the highest-performing LLM, GPT-4o, provided an answer accuracy of 74.3% (CI: 71.5-76.9%), human annotators, on average, outperformed LLMs with an accuracy of 79.5% (CI: 73.5-85.0%). With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (56.6%) and calculator knowledge (8.1%), our findings emphasize that humans continue to surpass LLMs on complex clinical tasks such as calculator recommendation.
Computation and Language,Artificial Intelligence,Human-Computer Interaction
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the performance of large language models (LLMs) in complex clinical decision - making tasks, especially whether they can effectively recommend medical calculators. Specifically, the researchers focused on the following points: 1. **Evaluating the capabilities of LLMs**: The researchers evaluated the performance of multiple large language models (including open - source, proprietary, and domain - specific models) in recommending clinical calculators. These calculators are used in clinical scenarios such as risk stratification, prognosis, and disease diagnosis. 2. **Comparison with human performance**: By comparing the performance of medical trainees and LLMs, the researchers hoped to understand whether LLMs can surpass or reach the human level in complex clinical tasks. 3. **Error analysis**: By classifying and analyzing the errors of LLMs, the researchers aimed to identify specific problems of LLMs in understanding and using clinical calculators, such as misunderstanding, calculator knowledge errors, etc. ### Main findings - **Performance of LLMs**: The highest - performing LLM (such as GPT - 4o) had an answer accuracy rate of 74.3% (confidence interval: 71.5% - 76.9%) in 1,009 questions, while the average accuracy rate of human annotators was 79.5% (confidence interval: 73.5% - 85.0%). This indicates that although LLMs perform well in some aspects, they are still inferior to humans overall. - **Error types**: The main error types of LLMs include misunderstanding (56.6%) and calculator knowledge errors (8.1%). These errors suggest the limitations of LLMs in dealing with complex clinical situations. ### Research significance This study emphasizes that in the actual clinical environment, although LLMs have certain capabilities, they still have not fully reached the human level in complex clinical decision - support tasks. Future research can further improve the application effect of LLMs in the medical field by expanding data sets, improving model training, and combining time - series data analysis, etc. ### Method overview 1. **Data sources**: The researchers selected 35 commonly used clinical calculators from MDCalc and constructed question - answer pairs using patient cases in the PMC - Patients data set. 2. **Question generation**: By truncating patient medical records and removing known calculator usage information, multiple - choice questions were generated. 3. **Evaluation process**: The performance of 8 LLMs and two medical trainees on 100 questions was evaluated respectively, and a detailed error analysis was carried out. Through this method, the researchers systematically evaluated the performance of LLMs in the clinical calculator recommendation task and provided directions for future improvements.