Evaluating large language models as agents in the clinic

Nikita Mehandru,Brenda Y. Miao,Eduardo Rodriguez Almaraz,Madhumita Sushil,Atul J. Butte,Ahmed Alaa
DOI: https://doi.org/10.1038/s41746-024-01083-y
IF: 15.2
2024-04-04
npj Digital Medicine
Abstract:Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent "agents" that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model's ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as "Artificial Intelligence Structured Clinical Examinations" ("AI-SCE"), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.
health care sciences & services,medical informatics
What problem does this paper attempt to address?
The paper primarily explores the application of large language models (LLMs) in the medical field and their evaluation methods. Specifically, the paper attempts to address the following key issues: 1. **Role of LLMs in Clinical Settings**: The paper discusses how LLMs can not only handle natural language but also act as intelligent "agents" that perform complex multi-step reasoning tasks and interact with tools, databases, and other agents to better respond to user requests. 2. **Methods for Developing LLM Agents**: By providing LLMs with different information sources and tools (such as clinical guidelines, electronic health record databases, etc.), the development of LLM agents that can support daily administrative tasks and clinical decision support in clinical settings is explored. 3. **Evaluating the Effectiveness and Safety of LLM Agents**: A simulation framework based on Agent-Based Modeling (ABM) called Artificial Intelligence Structured Clinical Evaluation (AI-SCE) is proposed to evaluate the performance of LLM agents in real clinical scenarios. This method can simulate interactions between patients and doctors as well as hospital processes, thereby assessing how LLM agents interact with users, use data or tools, and identify potential error points. 4. **Establishing Evaluation Standards**: Drawing on the Objective Structured Clinical Examination (OSCE) model in medical education, the AI-SCE framework is proposed to comprehensively evaluate the performance of LLM agents in real-world clinical workflows. These evaluation standards should cover the output results of LLM agents and their intermediate steps, including their reasoning process, tool usage, data management, and interactions with other agents or external users. By addressing these issues, the paper aims to advance the development of LLMs in the medical field and ensure their safety and effectiveness in practical applications.