Large Language Models as Agents in the Clinic

Nikita Mehandru,Brenda Y. Miao,Eduardo Rodriguez Almaraz,Madhumita Sushil,Atul J. Butte,Ahmed Alaa
2023-09-20
Abstract:Recent developments in large language models (LLMs) have unlocked new opportunities for healthcare, from information synthesis to clinical decision support. These new LLMs are not just capable of modeling language, but can also act as intelligent "agents" that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model's ability to process clinical data or answer standardized test questions, LLM agents should be assessed for their performance on real-world clinical tasks. These new evaluation frameworks, which we call "Artificial-intelligence Structured Clinical Examinations" ("AI-SCI"), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars. High-fidelity simulations may also be used to evaluate interactions between users and LLMs within a clinical workflow, or to model the dynamic interactions of multiple LLMs. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents into healthcare.
Human-Computer Interaction,Multiagent Systems
What problem does this paper attempt to address?
The paper attempts to address the issue of how to evaluate the effectiveness of large language models (LLMs) in clinical healthcare, especially when these models act as intelligent "agents" participating in actual clinical tasks. Traditional evaluation methods are mainly based on standard test questions or clinical data processing capabilities, but these methods cannot comprehensively measure the performance of LLMs in real-world clinical environments. Therefore, the authors propose a new evaluation framework—"Artificial Intelligence Structured Clinical Examination" (AI-SCI), which assesses the practicality and safety of LLMs in clinical workflows through high-fidelity simulations and multi-agent modeling. Specifically, the paper focuses on the following aspects: 1. **Limitations of traditional evaluation methods**: Existing evaluation methods such as MedQA and MedNLI mainly rely on standardized test questions or clinical texts, which cannot fully reflect the performance of LLMs in actual clinical environments. 2. **New evaluation framework**: The AI-SCI framework is proposed to evaluate the performance of LLMs in clinical tasks through high-fidelity simulations and multi-agent modeling. These simulations can include interactions with doctors, patients, and nursing staff, as well as dynamic interactions between multiple LLMs. 3. **Application of multi-agent modeling**: Utilizing multi-agent modeling (ABM) technology to simulate the behavior of LLMs in different clinical scenarios, evaluate their impact on clinical decision-making, and identify potential safety issues. 4. **Practical application cases**: Discussing the application of LLMs in actual healthcare systems, such as UC San Diego Health integrating GPT-4 into MyChart to streamline patient messaging. Through these methods, the paper aims to provide more comprehensive and reliable evaluation tools for the application of LLMs in clinical healthcare, thereby ensuring the safety and effectiveness of these technologies in actual deployment.