Abstract:Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent "agents" that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model's ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as "Artificial Intelligence Structured Clinical Examinations" ("AI-SCE"), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.

What problem does this paper attempt to address?

The paper primarily explores the application of large language models (LLMs) in the medical field and their evaluation methods. Specifically, the paper attempts to address the following key issues: 1. **Role of LLMs in Clinical Settings**: The paper discusses how LLMs can not only handle natural language but also act as intelligent "agents" that perform complex multi-step reasoning tasks and interact with tools, databases, and other agents to better respond to user requests. 2. **Methods for Developing LLM Agents**: By providing LLMs with different information sources and tools (such as clinical guidelines, electronic health record databases, etc.), the development of LLM agents that can support daily administrative tasks and clinical decision support in clinical settings is explored. 3. **Evaluating the Effectiveness and Safety of LLM Agents**: A simulation framework based on Agent-Based Modeling (ABM) called Artificial Intelligence Structured Clinical Evaluation (AI-SCE) is proposed to evaluate the performance of LLM agents in real clinical scenarios. This method can simulate interactions between patients and doctors as well as hospital processes, thereby assessing how LLM agents interact with users, use data or tools, and identify potential error points. 4. **Establishing Evaluation Standards**: Drawing on the Objective Structured Clinical Examination (OSCE) model in medical education, the AI-SCE framework is proposed to comprehensively evaluate the performance of LLM agents in real-world clinical workflows. These evaluation standards should cover the output results of LLM agents and their intermediate steps, including their reasoning process, tool usage, data management, and interactions with other agents or external users. By addressing these issues, the paper aims to advance the development of LLMs in the medical field and ensure their safety and effectiveness in practical applications.

Evaluating large language models as agents in the clinic

Large Language Models as Agents in the Clinic

Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Large Language Models Illuminate a Progressive Pathway to Artificial Intelligent Healthcare Assistant

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Large language models encode clinical knowledge

AI Hospital: Interactive Evaluation and Collaboration of LLMs As Intern Doctors for Clinical Diagnosis

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Evaluating large language models in medical applications: a survey

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment

Adaptive Reasoning and Acting in Medical Language Agents

Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review

Understanding and training for the impact of large language models and artificial intelligence in healthcare practice: a narrative review

Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making

Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models

Large Language Model-Based Evaluation of Medical Question Answering Systems: Algorithm Development and Case Study

Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm

Clinical Insights: A Comprehensive Review of Language Models in Medicine

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making