Large language models management of complex medication regimens: a case-based evaluation

Amoreena Most,Aaron Chase,Steven Xu,Tanner Hedrick,Brian Murray,Kelli Keats,Susan Smith,Erin Barreto,Tianming Liu,Andrea Sikora
DOI: https://doi.org/10.1101/2024.07.03.24309889
2024-07-08
Abstract:Background: Large language models (LLMs) have shown capability in diagnosing complex medical cases and passing medical licensing exams, but to date, only limited evaluations have studied how LLMs interpret, analyze, and optimize complex medication regimens. The purpose of this evaluation was to test four LLMs ability to identify medication errors and appropriate medication interventions on complex patient cases from the intensive care unit (ICU). Methods: A series of eight patient cases were developed by critical care pharmacists including history of present illness, laboratory values, vital signs, and medication regimens. Then, four LLMs (ChatGPT (GPT-3.5), ChatGPT (GPT-4), Claude2, and Llama2-7b) were prompted to develop a medication regimen for the patient. LLM generated medication regimens were then reviewed by a panel of seven critical care pharmacists to assess for presence of medication errors and clinical relevance. For each medication regimen recommended by the LLM, clinicians were asked to assess for if they would continue a medication, identify perceived medication errors in the medications recommended, identify the presence of life-threatening medication choices, and rank overall agreement on a 5-point Likert scale. Results: The clinician panel rated to continue therapies recommended by the LLMs between 55.8-67.9% of the time. Clinicians perceived between 1.57-4.29 medication errors per recommended regimen, and life-threatening recommendations were present between 15.0-55.3% of the time. Level agreement was between 1.85-2.67 for the four LLMs. Conclusions: LLMs demonstrated potential to serve as clinical decision support for the management of complex medication regimens with further domain specific training; however, caution should be used when employing LLMs for medication management given the present capabilities.
Pharmacology and Therapeutics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the capabilities of large language models (LLMs) in managing complex drug treatment regimens. Specifically, the research objective was to test the ability of four LLMs (ChatGPT (GPT - 3.5), ChatGPT (GPT - 4), Claude2, and Llama2 - 7b) to identify drug errors and appropriate drug interventions in complex patient cases in the intensive care unit (ICU). ### Research Background Large language models (LLMs) have demonstrated their capabilities in diagnosing complex medical cases and passing medical license examinations, etc. However, to date, research on how LLMs interpret, analyze, and optimize complex drug treatment regimens is still limited. Therefore, this study aims to fill this gap and evaluate the performance of LLMs in handling complex drug treatment regimens. ### Research Methods 1. **Data Sources**: The research team developed eight patient cases designed by intensive - care pharmacists. Each case includes the present medical history, laboratory values, vital signs, and the current drug treatment regimen. 2. **Research Design**: - **Two - step Prompt Process**: 1. **Initial Example Prompt**: Provide a detailed patient case, including the complete medical history, the current treatment plan, and the "ground - truth" drug treatment plan, to guide LLMs to understand the expected output format and clinical reasoning requirements. 2. **New Patient Scenario Prompt**: Provide new patient cases and require LLMs to generate updated drug treatment plans based on the provided information. - **Review Process**: A review panel consisting of seven intensive - care pharmacists evaluated the drug treatment regimens generated by each LLM, including whether to continue using the recommended drugs, identifying drug errors, assessing whether there are life - threatening suggestions, and scoring the overall level of agreement (a Likert scale of 1 - 5). ### Main Results - **Drug Continuation Rate**: The continuation rate of GPT - 3.5 was 59.4%, that of GPT - 4 was 67.5%, that of Llama2 - 7b was 56.4%, and that of Claude - 2 was 55.6%. - **Drug Errors**: There were an average of 1.57 to 4.29 drug errors in each LLM - recommended drug treatment regimen. - **Life - threatening Suggestions**: The life - threatening suggestion rate of GPT - 3.5 was 38.8%, that of GPT - 4 was 12.2%, that of Llama2 - 7b was 22.4%, and that of Claude - 2 was 46.9%. - **Overall Agreement**: The score of GPT - 3.5 was 2.20, that of GPT - 4 was 2.67, that of Llama2 - 7b was 2.03, and that of Claude - 2 was 1.85. ### Conclusions Although LLMs show certain potential in managing complex drug treatment regimens, they still need to be used with caution at present. The study found that LLMs will provide life - threatening drug suggestions in some cases, indicating that further domain - specific training and evaluation are required before applying them to clinical decision - support. ### Keywords Large Language Models; Artificial Intelligence; Pharmacy; Complexity of Drug Treatment Regimens