Abstract:Accurate medical decision-making is critical for both patients and clinicians. Patients often struggle to interpret their symptoms, determine their severity, and select the right specialist. Simultaneously, clinicians face challenges in integrating complex patient data to make timely, accurate diagnoses. Recent advances in large language models (LLMs) offer the potential to bridge this gap by supporting decision-making for both patients and healthcare providers. In this study, we benchmark multiple LLM versions and an LLM-based workflow incorporating retrieval-augmented generation (RAG) on a curated dataset of 2,000 medical cases derived from the Medical Information Mart for Intensive Care database. Our findings show that these LLMs are capable of providing personalized insights into likely diagnoses, suggesting appropriate specialists, and assessing urgent care needs. These models may also support clinicians in refining diagnoses and decision-making, offering a promising approach to improving patient outcomes and streamlining healthcare delivery.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the application effectiveness of large - language models (LLMs) in clinical decision support, especially in three key aspects: referral, triage, and diagnosis. Specifically, the researchers hope to answer the following questions: 1. **Referral**: Can LLMs recommend appropriate specialist doctors based on patients' symptoms and basic information? 2. **Triage**: Can LLMs accurately assess the urgency of patients' conditions (using the Emergency Severity Index, ESI) based on patients' symptoms and preliminary vital signs data? 3. **Diagnosis**: Can LLMs predict possible diagnosis results based on patients' symptoms and preliminary vital signs data? ### Background Clinical decision - making is a complex process that requires doctors to comprehensively consider multiple factors, such as symptoms, vital signs, medical history, and various test results, in order to make timely and accurate diagnoses. However, in the case of tight medical resources, doctors are under great pressure. Especially in high - pressure environments such as the emergency department (ED), rapid and accurate triage, diagnosis, and treatment are particularly important. Incorrect triage may lead to treatment delays or resource waste, thus affecting patient prognosis. ### Research Objectives To address these challenges, the researchers used multiple versions of LLMs and a retrieval - augmented generation (RAG) - based workflow to evaluate 2,000 real - world cases from the Medical Information Mart for Intensive Care database. Specific objectives include: 1. **Evaluating the performance of different LLMs**: Comparing the performance of different LLMs in referral, triage, and diagnosis tasks. 2. **Exploring the advantages of RAG - assisted LLMs**: Evaluating the effect of RAG - assisted LLMs in reducing "hallucinations" (i.e., generating inaccurate or irrelevant information). 3. **Analyzing performance in different user scenarios**: Evaluating the effectiveness of LLMs used by ordinary users (providing only symptom information) and clinicians (providing symptoms and preliminary vital signs data) respectively. ### Methods The researchers used the MIMIC - IV ED dataset to create a customized dataset containing 2,000 real - world cases. These cases cover a wide range of medical conditions and record patients' symptoms, vital signs, and final diagnoses in detail. The researchers tested multiple LLMs, including Claude 3.5 Sonnet, Claude 3 Sonnet, and Claude 3 Haiku, and developed a RAG - assisted LLM workflow. ### Main Findings 1. **Triage task**: The RAG - assisted LLM showed the highest exact - match accuracy in both user scenarios. Adding initial vital signs data can significantly improve the model's triage ability. 2. **Referral task**: Claude 3.5 Sonnet performed best in the task of predicting appropriate specialist doctors, but the performance differences between the models were small. 3. **Diagnosis task**: Claude 3.5 Sonnet and Claude 3 Sonnet performed excellently in predicting at least one correct diagnosis. The diagnostic ability of the RAG - assisted LLM was significantly improved when using initial vital signs data. ### Conclusions The research results show that LLMs have great potential in clinical decision support, especially in triage and diagnosis tasks. The RAG - assisted LLM reduces the "hallucination" phenomenon by introducing external reliable references, improving the reliability and accuracy of the model. These findings provide an important reference basis for the future application of AI technology in clinical decision - making.

Evaluating large language model workflows in clinical decision support: referral, triage, and diagnosis

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Large language models encode clinical knowledge

Towards Accurate Differential Diagnosis with Large Language Models

Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial

Evaluating large language models in medical applications: a survey

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

Deciphering Diagnoses: How Large Language Models Explanations Influence Clinical Decision Making

Large Language Models as Agents in the Clinic

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Large language models in solving clinical dilemmas - advantages and drawbacks

Leveraging Large Language Models for Decision Support in Personalized Oncology

Large language models for precision oncology: Clinical decision support through expert-guided learning.

Large Language Model Influence on Diagnostic Reasoning

Large Language Model Prompting Techniques for Advancement in Clinical Medicine

Large Language Models in Healthcare: A Comprehensive Benchmark

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Large Language Models and Medical Knowledge Grounding for Diagnosis Prediction