Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes

Balu Bhasuran,Qiao Jin,Yuzhang Xie,Carl Yang,Karim Hanna,Jennifer Costa,Cindy Shavor,Zhiyong Lu,Zhe He
2024-11-01
Abstract:Differential diagnosis is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study assesses the impact of lab test results on differential diagnoses (DDx) made by large language models (LLMs). Clinical vignettes from 50 case reports from PubMed Central were created incorporating patient demographics, symptoms, and lab results. Five LLMs GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data. A comprehensive evaluation involving GPT-4, a knowledge graph, and clinicians was conducted. GPT-4 performed best, achieving 55% accuracy for Top 1 diagnoses and 60% for Top 10 with lab data, with lenient accuracy up to 80%. Lab results significantly improved accuracy, with GPT-4 and Mixtral excelling, though exact match rates were low. Lab tests, including liver function, metabolic/toxicology panels, and serology/immune tests, were generally interpreted correctly by LLMs for differential diagnosis.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the impact of laboratory test results on the accuracy of differential diagnoses (DDx) in clinical case summaries generated by large - language models (LLMs). Specifically, by creating clinical case summaries that include patient demographic information, symptoms, and laboratory test results, the researchers tested the ability of five large - language models (GPT - 4, GPT - 3.5, Llama - 2 - 70b, Claude - 2, and Mixtral - 8x7B) to generate the top 10, top 5, and top 1 DDx with and without laboratory data, and explored how laboratory test results affect the accuracy of differential diagnoses by comprehensively evaluating the performance of these models. The main objectives of the study include: 1. **Evaluating the role of laboratory test results**: Determining whether laboratory test results can significantly improve the accuracy of large - language models in generating differential diagnoses. 2. **Comparing the performance of different models**: Contrasting the performance of different large - language models in generating differential diagnoses, especially the performance differences with and without laboratory data. 3. **Introducing knowledge graphs and GPT - 4 for automatic evaluation**: Utilizing a knowledge graph (Biomedical Knowledge Graph, BKG) and GPT - 4 for automatic evaluation to verify the accuracy and relevance of the differential diagnoses generated by these models. 4. **Conducting a comprehensive error analysis**: Through detailed error analysis, gaining in - depth understanding of the strengths and limitations of large - language models in generating differential diagnoses. Through these objectives, the study aims to explore how to use large - language models and laboratory test results to improve the diagnostic accuracy of clinical decision - support systems, thereby improving patient treatment outcomes and the quality of medical services.