Abstract:Background: The ability of large language models (LLMs) to interpret and generate human-like text has been accompanied with speculation about their application in medicine and clinical research. There is limited data available to inform evidence-based decisions on the appropriateness for specific use cases. Methods: We evaluated and compared four general-purpose LLMs (GPT-4, GPT-3.5-turbo, Flan-T5-XXL, and Zephyr-7B-Beta) and a healthcare-specific LLM (MedLLaMA-13B) on a set of 13 datasets - referred to as the Biomedical Language Understanding and Reasoning Benchmark (BLURB) - covering six commonly needed medical natural language processing tasks: named entity recognition (NER); relation extraction; population, interventions, comparators, and outcomes (PICO); sentence similarity; document classification; and question-answering. All models were evaluated without modification. Model performance was assessed according to a range of prompting strategies (formalised as a systematic, reusable prompting framework) and relied on the standard, task-specific evaluation metrics defined by BLURB. Results: Across all tasks, GPT-4 outperformed other LLMs, followed by Flan-T5-XXL and GPT-3.5- turbo, then Zephyr-7b-Beta and MedLLaMA-13B. The most performant prompts for GPT-4 and Flan-T5-XXL both outperformed the previously-reported best results for the PubMedQA task. The domain-specific MedLLaMA-13B achieved lower scores for most tasks except for question-answering tasks. We observed a substantial impact of strategically editing the prompt describing the task and a consistent improvement in performance when including examples semantically similar to the input text in the prompt. Conclusion: These results provide evidence of the potential LLMs may have for medical application and highlight the importance of robust evaluation before adopting LLMs for any specific use cases. Continuing to explore how these emerging technologies can be adapted for the healthcare setting, paired with human expertise, and enhanced through quality control measures will be important research to allow responsible innovation with LLMs in the medical area.

Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding

Large Language Models in Healthcare: A Comprehensive Benchmark

Large language models in medical and healthcare fields: applications, advances, and challenges

Large language models in healthcare and medical domain: A review

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

Large language models in health care: Development, applications, and challenges

Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review

Evaluating large language models in medical applications: a survey

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

Large Language Models Illuminate a Progressive Pathway to Artificial Intelligent Healthcare Assistant

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Large Language Model Prompting Techniques for Advancement in Clinical Medicine

Can Large Language Models Provide Emergency Medical Help Where There Is No Ambulance? A Comparative Study on Large Language Model Understanding of Emergency Medical Scenarios in Resource-Constrained Settings

Large language models encode clinical knowledge

Leveraging Large Language Models for Improved Patient Access and Self-Management in Oral Healthcare: an Assessor-blinded Preclinical Study (Preprint)

A Survey of Clinicians’ Views of the Utility of Large Language Models

Large language models in medicine: the potentials and pitfalls