Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning

Shreya Johri,Jaehwan Jeong,Benjamin A. Tran,Daniel I. Schlessinger,Shannon Wongvibulsin,Zhuo Ran Cai,Roxana Daneshjou,Pranav Rajpurkar
DOI: https://doi.org/10.1101/2023.09.12.23295399
2024-01-23
Abstract:The integration of Large Language Models (LLMs) like GPT-4 and GPT-3.5 into clinical diagnostics has the potential to transform patient-doctor interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD), a novel approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical exams, CRAFT-MD focuses on natural dialogues, using simulated AI agents to interact with LLMs in a controlled, ethical environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4 and GPT-3.5 in the context of skin diseases. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history taking, and diagnostic accuracy. Based on these findings, we propose a comprehensive set of guidelines for future evaluations of clinical LLMs. These guidelines emphasize realistic doctor-patient conversations, comprehensive history taking, open-ended questioning, and a combination of automated and expert evaluations. The introduction of CRAFT-MD marks a significant advancement in LLM testing, aiming to ensure that these models augment medical practice effectively and ethically.
Dermatology
What problem does this paper attempt to address?
The paper attempts to address the issue of the current evaluation methods for large language models (LLMs) in clinical diagnosis being inadequate. Specifically, traditional evaluation methods mainly rely on structured medical exams, such as multiple-choice questions, which cannot comprehensively assess the performance of LLMs in natural conversations, particularly in aspects like history taking, reasoning, and making diagnoses. Therefore, the paper proposes a new evaluation framework—CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine), aimed at more realistically assessing the diagnostic capabilities of clinical LLMs by simulating natural conversations between doctors and patients. ### Main Issues: 1. **Limitations of Traditional Evaluation Methods**: Existing evaluation methods mainly rely on structured medical exams, such as multiple-choice questions, which cannot comprehensively assess the performance of LLMs in natural conversations, especially in aspects like history taking, reasoning, and making diagnoses. 2. **Challenges in Practical Application of Clinical LLMs**: Despite the significant potential of LLMs in the medical field, their readiness for real clinical scenarios has not been fully tested, particularly in handling complex patient conversations and information integration. 3. **Ethical and Safety Issues**: Early application of LLMs in actual patient interactions may pose ethical and safety risks, necessitating rigorous evaluation in controlled environments. ### Solution: - **CRAFT-MD Framework**: By simulating natural conversations between doctors and patients, the framework evaluates the performance of LLMs in history taking, reasoning, and making diagnoses. The framework includes the following key components: - **Patient AI Agent**: Simulates patients and engages in natural conversations with LLMs based on detailed case descriptions. - **Scoring AI Agent**: Assesses the diagnostic accuracy of LLMs and compares it with the actual case descriptions. - **Human Expert Evaluation**: Ensures that LLMs can comprehensively collect medical histories and evaluates their performance in conversations. ### Research Findings: - **Decline in Diagnostic Accuracy**: In multi-turn conversations, the diagnostic accuracy of LLMs is lower than in static case descriptions and multiple-choice questions. - **Multi-turn Conversations Did Not Improve Accuracy**: Multi-turn conversations did not improve diagnostic accuracy as expected, instead exposing the limitations of LLMs in conversational reasoning. - **Conversation Summarization Technique**: Summarizing multi-turn conversations into concise case descriptions significantly improved the diagnostic accuracy of GPT-3.5. - **Importance of Physical Examination**: Removing physical examination details significantly reduced the diagnostic accuracy of LLMs, highlighting the importance of face-to-face clinical assessments or visual examinations in telemedicine. ### Guideline Recommendations: 1. **Evaluate Diagnostic Accuracy Through Realistic Doctor-Patient Conversations**: Dynamic and complex real-world medical conversations require more evaluation, not just static exams. 2. **Assess Comprehensive History Taking and Information Gathering Abilities**: Traditional evaluation methods often overlook the importance of history taking and information gathering, which are crucial for accurate diagnosis and effective treatment. 3. **Combine Automated and Expert Evaluation**: Use a combination of automated evaluation and human expert assessment to ensure the performance of LLMs in complex clinical scenarios. Through these methods, the paper aims to ensure that clinical LLMs can effectively assist doctors in practical applications rather than becoming a burden.