Abstract:Recent developments in large language models (LLMs) have unlocked new opportunities for healthcare, from information synthesis to clinical decision support. These new LLMs are not just capable of modeling language, but can also act as intelligent "agents" that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model's ability to process clinical data or answer standardized test questions, LLM agents should be assessed for their performance on real-world clinical tasks. These new evaluation frameworks, which we call "Artificial-intelligence Structured Clinical Examinations" ("AI-SCI"), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars. High-fidelity simulations may also be used to evaluate interactions between users and LLMs within a clinical workflow, or to model the dynamic interactions of multiple LLMs. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents into healthcare.

What problem does this paper attempt to address?

The paper attempts to address the issue of how to evaluate the effectiveness of large language models (LLMs) in clinical healthcare, especially when these models act as intelligent "agents" participating in actual clinical tasks. Traditional evaluation methods are mainly based on standard test questions or clinical data processing capabilities, but these methods cannot comprehensively measure the performance of LLMs in real-world clinical environments. Therefore, the authors propose a new evaluation framework—"Artificial Intelligence Structured Clinical Examination" (AI-SCI), which assesses the practicality and safety of LLMs in clinical workflows through high-fidelity simulations and multi-agent modeling. Specifically, the paper focuses on the following aspects: 1. **Limitations of traditional evaluation methods**: Existing evaluation methods such as MedQA and MedNLI mainly rely on standardized test questions or clinical texts, which cannot fully reflect the performance of LLMs in actual clinical environments. 2. **New evaluation framework**: The AI-SCI framework is proposed to evaluate the performance of LLMs in clinical tasks through high-fidelity simulations and multi-agent modeling. These simulations can include interactions with doctors, patients, and nursing staff, as well as dynamic interactions between multiple LLMs. 3. **Application of multi-agent modeling**: Utilizing multi-agent modeling (ABM) technology to simulate the behavior of LLMs in different clinical scenarios, evaluate their impact on clinical decision-making, and identify potential safety issues. 4. **Practical application cases**: Discussing the application of LLMs in actual healthcare systems, such as UC San Diego Health integrating GPT-4 into MyChart to streamline patient messaging. Through these methods, the paper aims to provide more comprehensive and reliable evaluation tools for the application of LLMs in clinical healthcare, thereby ensuring the safety and effectiveness of these technologies in actual deployment.

Large Language Models as Agents in the Clinic

Evaluating large language models as agents in the clinic

Large Language Models Illuminate a Progressive Pathway to Artificial Intelligent Healthcare Assistant

Large Language Model Prompting Techniques for Advancement in Clinical Medicine

Large language models encode clinical knowledge

Large language models in medicine: the potentials and pitfalls

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator

Large Language Models in the Medical Field: Principles and Applications

AI Hospital: Interactive Evaluation and Collaboration of LLMs As Intern Doctors for Clinical Diagnosis

Adaptive Reasoning and Acting in Medical Language Agents

Large Language Models in Healthcare: A Comprehensive Benchmark

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review

Demystifying Large Language Models for Medicine: A Primer

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Understanding and training for the impact of large language models and artificial intelligence in healthcare practice: a narrative review

Clinical Insights: A Comprehensive Review of Language Models in Medicine

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Large language models in medical and healthcare fields: applications, advances, and challenges