Evaluating large language model workflows in clinical decision support: referral, triage, and diagnosis

Farieda Gaber,Maqsood shaik,Vedran Franke,Altuna Akalin
DOI: https://doi.org/10.1101/2024.09.27.24314505
2024-09-28
Abstract:Accurate medical decision-making is critical for both patients and clinicians. Patients often struggle to interpret their symptoms, determine their severity, and select the right specialist. Simultaneously, clinicians face challenges in integrating complex patient data to make timely, accurate diagnoses. Recent advances in large language models (LLMs) offer the potential to bridge this gap by supporting decision-making for both patients and healthcare providers. In this study, we benchmark multiple LLM versions and an LLM-based workflow incorporating retrieval-augmented generation (RAG) on a curated dataset of 2,000 medical cases derived from the Medical Information Mart for Intensive Care database. Our findings show that these LLMs are capable of providing personalized insights into likely diagnoses, suggesting appropriate specialists, and assessing urgent care needs. These models may also support clinicians in refining diagnoses and decision-making, offering a promising approach to improving patient outcomes and streamlining healthcare delivery.
Health Informatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the application effectiveness of large - language models (LLMs) in clinical decision support, especially in three key aspects: referral, triage, and diagnosis. Specifically, the researchers hope to answer the following questions: 1. **Referral**: Can LLMs recommend appropriate specialist doctors based on patients' symptoms and basic information? 2. **Triage**: Can LLMs accurately assess the urgency of patients' conditions (using the Emergency Severity Index, ESI) based on patients' symptoms and preliminary vital signs data? 3. **Diagnosis**: Can LLMs predict possible diagnosis results based on patients' symptoms and preliminary vital signs data? ### Background Clinical decision - making is a complex process that requires doctors to comprehensively consider multiple factors, such as symptoms, vital signs, medical history, and various test results, in order to make timely and accurate diagnoses. However, in the case of tight medical resources, doctors are under great pressure. Especially in high - pressure environments such as the emergency department (ED), rapid and accurate triage, diagnosis, and treatment are particularly important. Incorrect triage may lead to treatment delays or resource waste, thus affecting patient prognosis. ### Research Objectives To address these challenges, the researchers used multiple versions of LLMs and a retrieval - augmented generation (RAG) - based workflow to evaluate 2,000 real - world cases from the Medical Information Mart for Intensive Care database. Specific objectives include: 1. **Evaluating the performance of different LLMs**: Comparing the performance of different LLMs in referral, triage, and diagnosis tasks. 2. **Exploring the advantages of RAG - assisted LLMs**: Evaluating the effect of RAG - assisted LLMs in reducing "hallucinations" (i.e., generating inaccurate or irrelevant information). 3. **Analyzing performance in different user scenarios**: Evaluating the effectiveness of LLMs used by ordinary users (providing only symptom information) and clinicians (providing symptoms and preliminary vital signs data) respectively. ### Methods The researchers used the MIMIC - IV ED dataset to create a customized dataset containing 2,000 real - world cases. These cases cover a wide range of medical conditions and record patients' symptoms, vital signs, and final diagnoses in detail. The researchers tested multiple LLMs, including Claude 3.5 Sonnet, Claude 3 Sonnet, and Claude 3 Haiku, and developed a RAG - assisted LLM workflow. ### Main Findings 1. **Triage task**: The RAG - assisted LLM showed the highest exact - match accuracy in both user scenarios. Adding initial vital signs data can significantly improve the model's triage ability. 2. **Referral task**: Claude 3.5 Sonnet performed best in the task of predicting appropriate specialist doctors, but the performance differences between the models were small. 3. **Diagnosis task**: Claude 3.5 Sonnet and Claude 3 Sonnet performed excellently in predicting at least one correct diagnosis. The diagnostic ability of the RAG - assisted LLM was significantly improved when using initial vital signs data. ### Conclusions The research results show that LLMs have great potential in clinical decision support, especially in triage and diagnosis tasks. The RAG - assisted LLM reduces the "hallucination" phenomenon by introducing external reliable references, improving the reliability and accuracy of the model. These findings provide an important reference basis for the future application of AI technology in clinical decision - making.