Evaluating Accuracy and Reproducibility of Large Language Model Performance in Pharmacy Education

Amoreena Most,Mengxuan Hu,Huibo Yang,Tianming Liu,Xianyan Chen,Sheng Li,Steven Xu,Zhengliang Liu,Andrea Sikora
DOI: https://doi.org/10.1101/2024.03.21.24304667
2024-03-24
Abstract:The purpose of this study was to compare performance of ChatGPT (GPT-3.5), ChatGPT (GPT-4), Claude2, Llama2-7b, and Llama2-13b on 219 multiple-choice questions focusing on critical care pharmacotherapy. To further assess the ability of engineering LLMs to improve reasoning abilities and performance, we examined responses with a zero-shot Chain-of-Thought (CoT) approach, CoT prompting, and a custom built GPT (PharmacyGPT). A 219 multiple-choice questions focused on critical care pharmacotherapy topics used in Doctor of Pharmacy curricula from two accredited colleges of pharmacy was compiled for this study. A total of five LLMs were evaluated: ChatGPT (GPT-3.5), ChatGPT (GPT-4), Claude2, Llama2-7b, and Llama2-13b. The primary outcome was response accuracy. Of the five LLMs tested, GPT-4 showed the highest average accuracy rate at 71.6%. A larger variance indicates lower consistency and reduced confidence in its answers. Llama2-13b had the lowest variance (0.070) of all the LLMs, but performed with an accuracy of 41.5%. Following analaysis of overall accuracy, performance on knowledge- vs. skill-based questions were assessed. All five LLMs demonstrated higher accuracy on knowledge-based questions compared to skill-based questions. GPT-4 had the highest accuracy for knowledge- and skill-based questions, with an accuracy of 87% and 67%, respectively. Response accuracy from LLMs in the domain of clinical pharmacy can be improved by using prompt engineering techniques.
Medical Education
What problem does this paper attempt to address?
This paper discusses the performance accuracy and reproducibility of large-scale language models (LLMs) in pharmaceutical education. The study compares the performance of five models, namely ChatGPT (GPT-3.5 and GPT-4), Claude2, Llama2-7b, and Llama2-13b, in answering 219 multiple-choice questions on critical care pharmacology. The main objective is to evaluate the potential of improving the inference ability and performance of LLMs using zero-shot Chain-of-Thought (CoT) method, CoT prompts, and customized GPT (PharmacyGPT). The study found that GPT-4 exhibited the highest average accuracy among all models at 71.6%, but with lower consistency in answers. Although Llama2-13b had the highest consistency, its accuracy was only 41.5%. All models performed better on knowledge-based questions compared to skill-based questions, particularly GPT-4, which achieved accuracy rates of 87% and 67% on the two question types, respectively. By employing specific prompt engineering techniques, the response accuracy of LLMs in clinical pharmacy can be improved. For instance, the use of CoT prompts can enhance the accuracy of GPT-4, especially on knowledge-based questions, surpassing the average level of pharmacy students. However, LLMs still perform lower than students on skill-based questions that require complex reasoning. The paper emphasizes that despite the potential displayed by LLMs in handling structured diagnostic problems, their reasoning ability in clinical pharmacy decision-making still needs improvement. Further research is required to optimize prompt strategies and enhance the clinical pharmacy reasoning ability of LLMs.