Abstract:Background: The ability of large language models (LLMs) to interpret and generate human-like text has been accompanied with speculation about their application in medicine and clinical research. There is limited data available to inform evidence-based decisions on the appropriateness for specific use cases. Methods: We evaluated and compared four general-purpose LLMs (GPT-4, GPT-3.5-turbo, Flan-T5-XXL, and Zephyr-7B-Beta) and a healthcare-specific LLM (MedLLaMA-13B) on a set of 13 datasets - referred to as the Biomedical Language Understanding and Reasoning Benchmark (BLURB) - covering six commonly needed medical natural language processing tasks: named entity recognition (NER); relation extraction; population, interventions, comparators, and outcomes (PICO); sentence similarity; document classification; and question-answering. All models were evaluated without modification. Model performance was assessed according to a range of prompting strategies (formalised as a systematic, reusable prompting framework) and relied on the standard, task-specific evaluation metrics defined by BLURB. Results: Across all tasks, GPT-4 outperformed other LLMs, followed by Flan-T5-XXL and GPT-3.5- turbo, then Zephyr-7b-Beta and MedLLaMA-13B. The most performant prompts for GPT-4 and Flan-T5-XXL both outperformed the previously-reported best results for the PubMedQA task. The domain-specific MedLLaMA-13B achieved lower scores for most tasks except for question-answering tasks. We observed a substantial impact of strategically editing the prompt describing the task and a consistent improvement in performance when including examples semantically similar to the input text in the prompt. Conclusion: These results provide evidence of the potential LLMs may have for medical application and highlight the importance of robust evaluation before adopting LLMs for any specific use cases. Continuing to explore how these emerging technologies can be adapted for the healthcare setting, paired with human expertise, and enhanced through quality control measures will be important research to allow responsible innovation with LLMs in the medical area.

MedMobile: A mobile-sized language model with expert-level clinical capabilities

Large language models encode clinical knowledge

Towards Expert-Level Medical Question Answering with Large Language Models

Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding

Assessing The Potential Of Mid-Sized Language Models For Clinical QA

Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks

Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding

SM70: A Large Language Model for Medical Devices

MedAide: Leveraging Large Language Models for On-Premise Medical Assistance on Edge Devices

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning

Can large language models reason about medical questions?

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

Emulating Human Cognitive Processes for Expert-Level Medical Question-Answering with Large Language Models

LLMs-in-the-loop Part-1: Expert Small AI Models for Bio-Medical Text Translation

BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

MedAdapter: Efficient Test-Time Adaptation of Large Language Models towards Medical Reasoning

Med-PMC: Medical Personalized Multi-modal Consultation with a Proactive Ask-First-Observe-Next Paradigm

Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment

Democratizing MLLMs in Healthcare: TinyLLaVA-Med for Efficient Healthcare Diagnostics in Resource-Constrained Settings