Evaluating the Limitations of Large Language Models in Therapeutic Decision-making for patients with Aortic Stenosis

Tobias Roeschl,Marie Hoffmann,Djawid Hashemi,Felix Rarreck,Nils Hinrichs,Tobias Daniel Trippel,Axel Unbehaun,Christoph Klein,Jörg Kempfert,Henryk Dreger,Benjamin O'Brien,Gerhard Hindricks,Felix Balzer,Volkmar Falk,Alexander Mayer
DOI: https://doi.org/10.1101/2024.11.20.24313385
2024-11-23
Abstract:Aims: Large language models (LLMs) have shown promise in therapeutic decision-making comparable to medical experts, but these studies have used specially prepared patient data. The aim of this study was to determine whether LLMs can make guideline-adherent treatment decisions based on real-world patient data. Methods and Results: We conducted a retrospective study of 80 patients with severe aortic stenosis who were scheduled for either surgical (SAVR, n=24) or transcatheter aortic valve replacement (TAVR, n=56) by our institutional heart team in 2022. Various LLMs (BioGPT, GPT-3.5, GPT-4 and GPT-4 Turbo, Llama-2, Mistral, and PaLM-2) were queried using either deidentified original medical reports or manually generated case summaries to determine the most guideline-adherent treatment. Agreement with the Heart Team was measured using Cohen's kappa coefficients, reliability using intraclass correlation coefficients (ICCs), and fairness using frequency bias indices (FBIs) with FBIs >1 indicating bias towards TAVR. When presented with original medical reports, LLMs showed poor performance (kappa: -0.47-0.09, ICC: 0.0-0.91, FBI: 0.95-1.53). The LLMs' performance improved substantially when case summaries were used as input and additional guideline knowledge was added to the prompt (kappa: -0.02-0.62, ICC: 0.01-0.97, FBI: 0.46-1.24). Qualitative analysis revealed instances of hallucinations in all LLMs tested. Conclusion: Our findings suggest that even advanced LLMs currently make informed treatment decisions only with extensively pre-processed data, not with original patient data. Unreliable responses, bias and hallucinations pose significant health risks and highlight the need for caution in applying LLMs to real-world clinical decision-making.
What problem does this paper attempt to address?