Abstract:Background: The ability of large language models (LLMs) to interpret and generate human-like text has been accompanied with speculation about their application in medicine and clinical research. There is limited data available to inform evidence-based decisions on the appropriateness for specific use cases. Methods: We evaluated and compared four general-purpose LLMs (GPT-4, GPT-3.5-turbo, Flan-T5-XXL, and Zephyr-7B-Beta) and a healthcare-specific LLM (MedLLaMA-13B) on a set of 13 datasets - referred to as the Biomedical Language Understanding and Reasoning Benchmark (BLURB) - covering six commonly needed medical natural language processing tasks: named entity recognition (NER); relation extraction; population, interventions, comparators, and outcomes (PICO); sentence similarity; document classification; and question-answering. All models were evaluated without modification. Model performance was assessed according to a range of prompting strategies (formalised as a systematic, reusable prompting framework) and relied on the standard, task-specific evaluation metrics defined by BLURB. Results: Across all tasks, GPT-4 outperformed other LLMs, followed by Flan-T5-XXL and GPT-3.5- turbo, then Zephyr-7b-Beta and MedLLaMA-13B. The most performant prompts for GPT-4 and Flan-T5-XXL both outperformed the previously-reported best results for the PubMedQA task. The domain-specific MedLLaMA-13B achieved lower scores for most tasks except for question-answering tasks. We observed a substantial impact of strategically editing the prompt describing the task and a consistent improvement in performance when including examples semantically similar to the input text in the prompt. Conclusion: These results provide evidence of the potential LLMs may have for medical application and highlight the importance of robust evaluation before adopting LLMs for any specific use cases. Continuing to explore how these emerging technologies can be adapted for the healthcare setting, paired with human expertise, and enhanced through quality control measures will be important research to allow responsible innovation with LLMs in the medical area.

A Fine-Tuned Large Language Model for Domain-Specific with Reinforcement Learning

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

Fine-Tuning Large Language Models for Scientific Text Classification: A Comparative Study

Fine-Tuning Large Language Models in Education

Continuous Training and Fine-tuning for Domain-Specific Language Models in Medical Question Answering

Fine-Tuning Medical Language Models for Enhanced Long-Contextual Understanding and Domain Expertise

Unveiling the Generalization Power of Fine-Tuned Large Language Models

Enhancing the Traditional Chinese Medicine Capabilities of Large Language Model through Reinforcement Learning from AI Feedback

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning

Fine-tuning large neural language models for biomedical natural language processing

A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks

Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse

Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data

Large Language Models with Controllable Working Memory

Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models

PMC-LLaMA: Further Finetuning LLaMA on Medical Papers

[Synthesis, identification of artificial antigen of catalpol and preliminary study of immunogenicity].