Abstract:Background: Large language models (LLMs) have shown capability in diagnosing complex medical cases and passing medical licensing exams, but to date, only limited evaluations have studied how LLMs interpret, analyze, and optimize complex medication regimens. The purpose of this evaluation was to test four LLMs ability to identify medication errors and appropriate medication interventions on complex patient cases from the intensive care unit (ICU). Methods: A series of eight patient cases were developed by critical care pharmacists including history of present illness, laboratory values, vital signs, and medication regimens. Then, four LLMs (ChatGPT (GPT-3.5), ChatGPT (GPT-4), Claude2, and Llama2-7b) were prompted to develop a medication regimen for the patient. LLM generated medication regimens were then reviewed by a panel of seven critical care pharmacists to assess for presence of medication errors and clinical relevance. For each medication regimen recommended by the LLM, clinicians were asked to assess for if they would continue a medication, identify perceived medication errors in the medications recommended, identify the presence of life-threatening medication choices, and rank overall agreement on a 5-point Likert scale. Results: The clinician panel rated to continue therapies recommended by the LLMs between 55.8-67.9% of the time. Clinicians perceived between 1.57-4.29 medication errors per recommended regimen, and life-threatening recommendations were present between 15.0-55.3% of the time. Level agreement was between 1.85-2.67 for the four LLMs. Conclusions: LLMs demonstrated potential to serve as clinical decision support for the management of complex medication regimens with further domain specific training; however, caution should be used when employing LLMs for medication management given the present capabilities.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the capabilities of large language models (LLMs) in managing complex drug treatment regimens. Specifically, the research objective was to test the ability of four LLMs (ChatGPT (GPT - 3.5), ChatGPT (GPT - 4), Claude2, and Llama2 - 7b) to identify drug errors and appropriate drug interventions in complex patient cases in the intensive care unit (ICU). ### Research Background Large language models (LLMs) have demonstrated their capabilities in diagnosing complex medical cases and passing medical license examinations, etc. However, to date, research on how LLMs interpret, analyze, and optimize complex drug treatment regimens is still limited. Therefore, this study aims to fill this gap and evaluate the performance of LLMs in handling complex drug treatment regimens. ### Research Methods 1. **Data Sources**: The research team developed eight patient cases designed by intensive - care pharmacists. Each case includes the present medical history, laboratory values, vital signs, and the current drug treatment regimen. 2. **Research Design**: - **Two - step Prompt Process**: 1. **Initial Example Prompt**: Provide a detailed patient case, including the complete medical history, the current treatment plan, and the "ground - truth" drug treatment plan, to guide LLMs to understand the expected output format and clinical reasoning requirements. 2. **New Patient Scenario Prompt**: Provide new patient cases and require LLMs to generate updated drug treatment plans based on the provided information. - **Review Process**: A review panel consisting of seven intensive - care pharmacists evaluated the drug treatment regimens generated by each LLM, including whether to continue using the recommended drugs, identifying drug errors, assessing whether there are life - threatening suggestions, and scoring the overall level of agreement (a Likert scale of 1 - 5). ### Main Results - **Drug Continuation Rate**: The continuation rate of GPT - 3.5 was 59.4%, that of GPT - 4 was 67.5%, that of Llama2 - 7b was 56.4%, and that of Claude - 2 was 55.6%. - **Drug Errors**: There were an average of 1.57 to 4.29 drug errors in each LLM - recommended drug treatment regimen. - **Life - threatening Suggestions**: The life - threatening suggestion rate of GPT - 3.5 was 38.8%, that of GPT - 4 was 12.2%, that of Llama2 - 7b was 22.4%, and that of Claude - 2 was 46.9%. - **Overall Agreement**: The score of GPT - 3.5 was 2.20, that of GPT - 4 was 2.67, that of Llama2 - 7b was 2.03, and that of Claude - 2 was 1.85. ### Conclusions Although LLMs show certain potential in managing complex drug treatment regimens, they still need to be used with caution at present. The study found that LLMs will provide life - threatening drug suggestions in some cases, indicating that further domain - specific training and evaluation are required before applying them to clinical decision - support. ### Keywords Large Language Models; Artificial Intelligence; Pharmacy; Complexity of Drug Treatment Regimens

Large language models management of complex medication regimens: a case-based evaluation

Unlocking the potential of advanced large language models in medication review and reconciliation: A proof-of-concept investigation

Evaluating Accuracy and Reproducibility of Large Language Model Performance in Pharmacy Education

Large language models in solving clinical dilemmas - advantages and drawbacks

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

Large language models encode clinical knowledge

Evaluation of large language models as a diagnostic aid for complex medical cases

Large language models for preventing medication direction errors in online pharmacies

Evaluating the use of large language models to provide clinical recommendations in the Emergency Department

The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant.

Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Humans Continue to Outperform Large Language Models in Complex Clinical Decision-Making: A Study with Medical Calculators

Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models' feasibility in clinical decision-making

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Proactive Polypharmacy Management Using Large Language Models: Opportunities to Enhance Geriatric Care

A comparison of the diagnostic ability of large language models in challenging clinical cases