Abstract:Background: The ability of large language models (LLMs) to interpret and generate human-like text has been accompanied with speculation about their application in medicine and clinical research. There is limited data available to inform evidence-based decisions on the appropriateness for specific use cases. Methods: We evaluated and compared four general-purpose LLMs (GPT-4, GPT-3.5-turbo, Flan-T5-XXL, and Zephyr-7B-Beta) and a healthcare-specific LLM (MedLLaMA-13B) on a set of 13 datasets - referred to as the Biomedical Language Understanding and Reasoning Benchmark (BLURB) - covering six commonly needed medical natural language processing tasks: named entity recognition (NER); relation extraction; population, interventions, comparators, and outcomes (PICO); sentence similarity; document classification; and question-answering. All models were evaluated without modification. Model performance was assessed according to a range of prompting strategies (formalised as a systematic, reusable prompting framework) and relied on the standard, task-specific evaluation metrics defined by BLURB. Results: Across all tasks, GPT-4 outperformed other LLMs, followed by Flan-T5-XXL and GPT-3.5- turbo, then Zephyr-7b-Beta and MedLLaMA-13B. The most performant prompts for GPT-4 and Flan-T5-XXL both outperformed the previously-reported best results for the PubMedQA task. The domain-specific MedLLaMA-13B achieved lower scores for most tasks except for question-answering tasks. We observed a substantial impact of strategically editing the prompt describing the task and a consistent improvement in performance when including examples semantically similar to the input text in the prompt. Conclusion: These results provide evidence of the potential LLMs may have for medical application and highlight the importance of robust evaluation before adopting LLMs for any specific use cases. Continuing to explore how these emerging technologies can be adapted for the healthcare setting, paired with human expertise, and enhanced through quality control measures will be important research to allow responsible innovation with LLMs in the medical area.

Evaluating the Potential of Leading Large Language Models in Reasoning Biology Questions

Can large language models reason about medical questions?

Survey on Reasoning Capabilities and Accessibility of Large Language Models Using Biology-related Questions

Exploring the Potential of Large Language Models in Molecular Tasks: An Insightful Evaluation with GPT‐4

An Evaluation of Large Language Models in Bioinformatics Research

Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions

GLoRE: Evaluating Logical Reasoning of Large Language Models

Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification

Coupling Large Language Models with Logic Programming for Robust and General Reasoning from Text

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

Large Language Model for Science: A Study on P vs. NP

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

Reasoning with large language models for medical question answering

Large Language Models in Medical Term Classification and Unexpected Misalignment Between Response and Reasoning

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

Evaluating the ChatGPT family of models for biomedical reasoning and classification

Can Large Language Models do Analytical Reasoning?

Evaluating Large Language Models in Ophthalmology

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Large Language Models Are Not Strong Abstract Reasoners