Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making

Paul Hager,Friederike Jungmann,Kunal Bhagat,Inga Hubrecht,Manuel Knauer,Jakob Vielhauer,Robbie Holland,Rickmer Braren,Marcus Makowski,Georgios Kaisis,Daniel Rueckert
DOI: https://doi.org/10.1101/2024.01.26.24301810
2024-01-26
Abstract:Clinical decision making is one of the most impactful parts of a physician’s responsibilities and stands to benefit greatly from AI solutions and large language models (LLMs) in particular. However, while LLMs have achieved excellent performance on medical licensing exams, these tests fail to assess many skills that are necessary for deployment in a realistic clinical decision making environment, including gathering information, adhering to established guidelines, and integrating into clinical workflows. To understand how useful LLMs are in real-world settings, we must evaluate them , i.e. on real-world data under realistic conditions. Here we have created a curated dataset based on the MIMIC-IV database spanning 2400 real patient cases and four common abdominal pathologies as well as a framework to simulate a realistic clinical setting. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians on average), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients. Furthermore, we move beyond diagnostic accuracy and demonstrate that they cannot be easily integrated into existing workflows because they often fail to follow instructions and are sensitive to both the quantity and order of information. Overall, our analysis reveals that LLMs are currently not ready for clinical deployment while providing a dataset and framework to guide future studies.
Health Informatics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate and mitigate the limitations of large language models (LLMs) in clinical decision - making. Specifically, the researchers focus on the following aspects: 1. **Diagnostic accuracy**: - The researchers created a customized dataset (MIMIC - CDM) based on the MIMIC - IV database, which contains 2,400 real - patient cases covering four common abdominal pathologies (appendicitis, cholecystitis, diverticulitis, and pancreatitis). They used this dataset to evaluate the diagnostic accuracy of LLMs in real - world conditions. - The results show that the current state - of - the - art LLMs have significantly lower diagnostic accuracy in all pathologies than doctors, especially in cases where information needs to be collected independently. 2. **Adherence to diagnostic and treatment guidelines**: - LLMs fail to operate in accordance with established diagnostic and treatment guidelines. For example, they often do not perform necessary physical examinations, laboratory tests, or imaging examinations as required. - The study found that LLMs perform very poorly in interpreting laboratory results, especially in key categories such as low - value and high - value test results, which may lead to incorrect diagnosis and treatment recommendations. 3. **Instruction - following ability**: - LLMs are deficient in following instructions, and are particularly sensitive when the amount and order of information change, which makes it difficult for them to be integrated into existing clinical workflows. 4. **Robustness**: - LLMs are very sensitive to changes in instructions, the amount and order of information, which further limits their practicality in the clinical environment. 5. **Safety**: - Due to the above problems, LLMs cannot be safely used for clinical decision - making at present, because they may make diagnoses based on incomplete information, thus posing a risk to patient health. Overall, this paper reveals multiple limitations of LLMs in clinical decision - making by constructing a dataset and evaluation framework that simulates a real - clinical environment, and emphasizes the importance of solving these problems before practical application.