Abstract:The integration of Large Language Models (LLMs) like GPT-4 and GPT-3.5 into clinical diagnostics has the potential to transform patient-doctor interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD), a novel approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical exams, CRAFT-MD focuses on natural dialogues, using simulated AI agents to interact with LLMs in a controlled, ethical environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4 and GPT-3.5 in the context of skin diseases. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history taking, and diagnostic accuracy. Based on these findings, we propose a comprehensive set of guidelines for future evaluations of clinical LLMs. These guidelines emphasize realistic doctor-patient conversations, comprehensive history taking, open-ended questioning, and a combination of automated and expert evaluations. The introduction of CRAFT-MD marks a significant advancement in LLM testing, aiming to ensure that these models augment medical practice effectively and ethically.

What problem does this paper attempt to address?

The paper attempts to address the issue of the current evaluation methods for large language models (LLMs) in clinical diagnosis being inadequate. Specifically, traditional evaluation methods mainly rely on structured medical exams, such as multiple-choice questions, which cannot comprehensively assess the performance of LLMs in natural conversations, particularly in aspects like history taking, reasoning, and making diagnoses. Therefore, the paper proposes a new evaluation framework—CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine), aimed at more realistically assessing the diagnostic capabilities of clinical LLMs by simulating natural conversations between doctors and patients. ### Main Issues: 1. **Limitations of Traditional Evaluation Methods**: Existing evaluation methods mainly rely on structured medical exams, such as multiple-choice questions, which cannot comprehensively assess the performance of LLMs in natural conversations, especially in aspects like history taking, reasoning, and making diagnoses. 2. **Challenges in Practical Application of Clinical LLMs**: Despite the significant potential of LLMs in the medical field, their readiness for real clinical scenarios has not been fully tested, particularly in handling complex patient conversations and information integration. 3. **Ethical and Safety Issues**: Early application of LLMs in actual patient interactions may pose ethical and safety risks, necessitating rigorous evaluation in controlled environments. ### Solution: - **CRAFT-MD Framework**: By simulating natural conversations between doctors and patients, the framework evaluates the performance of LLMs in history taking, reasoning, and making diagnoses. The framework includes the following key components: - **Patient AI Agent**: Simulates patients and engages in natural conversations with LLMs based on detailed case descriptions. - **Scoring AI Agent**: Assesses the diagnostic accuracy of LLMs and compares it with the actual case descriptions. - **Human Expert Evaluation**: Ensures that LLMs can comprehensively collect medical histories and evaluates their performance in conversations. ### Research Findings: - **Decline in Diagnostic Accuracy**: In multi-turn conversations, the diagnostic accuracy of LLMs is lower than in static case descriptions and multiple-choice questions. - **Multi-turn Conversations Did Not Improve Accuracy**: Multi-turn conversations did not improve diagnostic accuracy as expected, instead exposing the limitations of LLMs in conversational reasoning. - **Conversation Summarization Technique**: Summarizing multi-turn conversations into concise case descriptions significantly improved the diagnostic accuracy of GPT-3.5. - **Importance of Physical Examination**: Removing physical examination details significantly reduced the diagnostic accuracy of LLMs, highlighting the importance of face-to-face clinical assessments or visual examinations in telemedicine. ### Guideline Recommendations: 1. **Evaluate Diagnostic Accuracy Through Realistic Doctor-Patient Conversations**: Dynamic and complex real-world medical conversations require more evaluation, not just static exams. 2. **Assess Comprehensive History Taking and Information Gathering Abilities**: Traditional evaluation methods often overlook the importance of history taking and information gathering, which are crucial for accurate diagnosis and effective treatment. 3. **Combine Automated and Expert Evaluation**: Use a combination of automated evaluation and human expert assessment to ensure the performance of LLMs in complex clinical scenarios. Through these methods, the paper aims to ensure that clinical LLMs can effectively assist doctors in practical applications rather than becoming a burden.

Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning

Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment

MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery

Comparative Evaluation of LLMs in Clinical Oncology

PALLM: Evaluating and Enhancing PALLiative Care Conversations with Large Language Models

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks

Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models' feasibility in clinical decision-making

A Survey of Clinicians’ Views of the Utility of Large Language Models

Large Language Model Augmented Clinical Trial Screening

Large language models encode clinical knowledge

Performance of large language models at the MRCS Part A: a tool for medical education?

Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4

A Novel Nuanced Conversation Evaluation Framework for Large Language Models in Mental Health

Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding

Evaluating large language models as agents in the clinic

Evaluation of large language models in breast cancer clinical scenarios: A comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2

Evaluating the Impact of a Specialized LLM on Physician Experience in Clinical Decision Support: A Comparison of Ask Avo and ChatGPT-4