Abstract:Evaluating large language models (LLM) in clinical scenarios is crucial to assessing their potential clinical utility. Existing benchmarks rely heavily on static question-answering, which does not accurately depict the complex, sequential nature of clinical decision-making. Here, we introduce AgentClinic, a multimodal agent benchmark for evaluating LLMs in simulated clinical environments that include patient interactions, multimodal data collection under incomplete information, and the usage of various tools, resulting in an in-depth evaluation across nine medical specialties and seven languages. We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy. Overall, we observe that agents sourced from Claude-3.5 outperform other LLM backbones in most settings. Nevertheless, we see stark differences in the LLMs' ability to make use of tools, such as experiential learning, adaptive retrieval, and reflection cycles. Strikingly, Llama-3 shows up to 92% relative improvements with the notebook tool that allows for writing and editing notes that persist across cases. To further scrutinize our clinical simulations, we leverage real-world electronic health records, perform a clinical reader study, perturb agents with biases, and explore novel patient-centric metrics that this interactive environment firstly enables.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the capabilities of large - language models (LLMs) in a clinical environment. Existing benchmark tests mainly rely on static question - and - answer forms, which cannot accurately reflect the complexity and continuity of clinical decision - making. The paper introduces a new multi - modal agent benchmark - AgentClinic, which is used to evaluate LLMs in a simulated clinical environment. By simulating scenarios such as patient interactions, multi - modal data collection, and tool use, AgentClinic can more in - depth evaluate the performance of LLMs in nine medical specialties and seven languages. Specifically, the paper attempts to solve the following key problems: 1. **The gap between static question - and - answer and actual clinical decision - making**: Existing evaluation methods are mainly based on static question - and - answer forms, which cannot truly reflect the decision - making process of doctors in the face of uncertainty, limited information, and resource constraints in clinical practice. AgentClinic more realistically simulates complex tasks in clinical work by introducing an interactive, conversation - driven sequential decision - making environment. 2. **The use of multi - modal data and tools**: In clinical work, doctors need to collect multiple types of data (such as body temperature, blood pressure, electrocardiogram, etc.) and use various tools (such as medical image interpretation, electronic health records, etc.). AgentClinic evaluates the performance of LLMs in the actual clinical environment by simulating the collection of these multi - modal data and the use of tools. 3. **Applicability across languages and specialties**: The clinical environment involves multiple languages and professional fields. AgentClinic evaluates the performance of LLMs in different language and professional backgrounds by providing cases in nine medical specialties and seven languages. 4. **The influence of cognitive and implicit biases**: In the clinical environment, doctors and patients may be affected by various cognitive and implicit biases. AgentClinic evaluates their impact on the diagnostic accuracy of LLMs by introducing these biases and studies patients' trust in doctors with biases and their compliance with medical advice. 5. **Patients' perception and satisfaction**: In addition to diagnostic accuracy, patients' perception and satisfaction are also important indicators for evaluating clinical systems. AgentClinic evaluates patients' trust in doctors, treatment compliance, and willingness to consult again by simulating conversations between patients and doctors. By solving these problems, AgentClinic aims to provide a more comprehensive and realistic benchmark testing platform for evaluating and improving the application of LLMs in the clinical environment.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Evaluating large language models as agents in the clinic

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

Large Language Models as Agents in the Clinic

Adaptive Reasoning and Acting in Medical Language Agents

ClinicalAgent: Clinical Trial Multi-Agent System with Large Language Model-based Reasoning

AI Hospital: Interactive Evaluation and Collaboration of LLMs As Intern Doctors for Clinical Diagnosis

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Large language models encode clinical knowledge

AgentBench: Evaluating LLMs as Agents

Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework

Benchmarking the Confidence of Large Language Models in Clinical Questions

Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm

Autonomous Artificial Intelligence Agents for Clinical Decision Making in Oncology

Med-PMC: Medical Personalized Multi-modal Consultation with a Proactive Ask-First-Observe-Next Paradigm

Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents

ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents