Abstract:Background: The availability of increasingly powerful large language models (LLMs) has attracted substantial interest in their potential for interpreting and generating human-like text for biomedical and clinical applications. However, there are often demands for high accuracy, concerns about balancing generalizability and domain-specificity, and questions about prompting robustness when considering the adoption of LLMs for specific use cases. There also is a lack of a framework or method to help choose which LLMs (or prompting strategies) should be adopted for specific biomedical or clinical tasks. Objective: To address the speculations on applying LLMs for biomedical applications, this study aims to 1) propose a framework to comprehensively evaluate and compare the performance of a range of LLMs and prompting techniques on a suite of biomedical natural language processing (NLP) tasks; 2) use the framework to benchmark several general- purpose LLMs and biomedical domain-specific LLMs. Methods: We evaluated and compared six general-purpose LLMs (GPT-4, GPT-3.5-Turbo, Flan-T5-XXL, Llama-3-8B-Instruct, Yi-1.5-34B-Chat, and Zephyr-7B-Beta) and three healthcare-specific LLMs (Medicine-Llama3-8B, Meditron-7B, and MedLLaMA-13B) on a set of 13 datasets - referred to as the Biomedical Language Understanding and Reasoning Benchmark (BLURB) - covering six commonly needed medical natural language processing tasks: named entity recognition (NER); relation extraction (RE); population, interventions, comparators, and outcomes (PICO); sentence similarity (SS); document classification (Class.); and question-answering (QA). All models were evaluated without further training or fine-tuning. Model performance was assessed according to a range of prompting strategies (formalized as a systematic, reusable prompting framework) and relied on the standard, task-specific evaluation metrics defined by BLURB. Results: Across all tasks, GPT-4 outperformed other LLMs, achieving a score of 64.6 on the benchmark, though other models, such as Flan-T5-XXL and Llama-3-8B-Instruct, demonstrated competitive performance on multiple tasks. We found that general-purpose models achieved better overall scores than domain-specific models, sometimes by significant margins. We observed a substantial impact of strategically editing the prompt describing the task and a consistent improvement in performance when including examples semantically similar to the input text. Additionally, the most performant prompts for nearly half the models outperformed the previously reported best results for the PubMedQA dataset from the BLURB leaderboard. Conclusions: These results provide evidence of the potential LLMs may have for biomedical applications and highlight the importance of robust evaluation before adopting LLMs for any specific use cases. Notably, performant open-source LLMs such as Llama-3- 8B-Instruct and Flan-T5-XXL show promise for use cases where trustworthiness and data confidentiality are concerns, as these models can be hosted locally, offering better security, transparency, and explainability. Continuing to explore how these emerging technologies can be adapted for the healthcare setting, paired with human expertise, and enhanced through quality control measures will be important research to allow responsible innovation with LLMs in the biomedical area.

Multimodal Large Language Model Passes Specialty Board Examination and Surpasses Human Test-Taker Scores: A Comparative Analysis Examining the Stepwise Impact of Model Prompting Strategies on Performance

Will code one day run a code? Performance of language models on ACEM primary examinations and implications

Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions

How Large Language Models Perform on the United States Medical Licensing Examination: A Systematic Review

Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions

Performance of Large Language Models on a Neurology Board-Style Examination

Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models

Large Language Models in Pathology: A Comparative Study on Multiple Choice Question Performance with Pathology Trainees

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

Evaluating multiple large language models in pediatric ophthalmology

Performance of large language models at the MRCS Part A: a tool for medical education?

Evaluating AI Proficiency in Nuclear Cardiology: Large Language Models take on the Board Preparation Exam

Evaluating General Vision-Language Models for Clinical Medicine

Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023

Large Language Models for Medical OSCE Assessment: A Novel Approach to Transcript Analysis

Large language models (LLMs) in radiology exams for medical students: Performance and consequences

Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations

Large language foundation models encode clinical radiation oncology domain knowledge: Performance on the American College of Radiology Standardized Examination.

Large language models in pathology: A comparative study of ChatGPT and bard with pathology trainees on multiple-choice questions

Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study

Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark : Comparative Study