Abstract:Background: The availability of increasingly powerful large language models (LLMs) has attracted substantial interest in their potential for interpreting and generating human-like text for biomedical and clinical applications. However, there are often demands for high accuracy, concerns about balancing generalizability and domain-specificity, and questions about prompting robustness when considering the adoption of LLMs for specific use cases. There also is a lack of a framework or method to help choose which LLMs (or prompting strategies) should be adopted for specific biomedical or clinical tasks. Objective: To address the speculations on applying LLMs for biomedical applications, this study aims to 1) propose a framework to comprehensively evaluate and compare the performance of a range of LLMs and prompting techniques on a suite of biomedical natural language processing (NLP) tasks; 2) use the framework to benchmark several general- purpose LLMs and biomedical domain-specific LLMs. Methods: We evaluated and compared six general-purpose LLMs (GPT-4, GPT-3.5-Turbo, Flan-T5-XXL, Llama-3-8B-Instruct, Yi-1.5-34B-Chat, and Zephyr-7B-Beta) and three healthcare-specific LLMs (Medicine-Llama3-8B, Meditron-7B, and MedLLaMA-13B) on a set of 13 datasets - referred to as the Biomedical Language Understanding and Reasoning Benchmark (BLURB) - covering six commonly needed medical natural language processing tasks: named entity recognition (NER); relation extraction (RE); population, interventions, comparators, and outcomes (PICO); sentence similarity (SS); document classification (Class.); and question-answering (QA). All models were evaluated without further training or fine-tuning. Model performance was assessed according to a range of prompting strategies (formalized as a systematic, reusable prompting framework) and relied on the standard, task-specific evaluation metrics defined by BLURB. Results: Across all tasks, GPT-4 outperformed other LLMs, achieving a score of 64.6 on the benchmark, though other models, such as Flan-T5-XXL and Llama-3-8B-Instruct, demonstrated competitive performance on multiple tasks. We found that general-purpose models achieved better overall scores than domain-specific models, sometimes by significant margins. We observed a substantial impact of strategically editing the prompt describing the task and a consistent improvement in performance when including examples semantically similar to the input text. Additionally, the most performant prompts for nearly half the models outperformed the previously reported best results for the PubMedQA dataset from the BLURB leaderboard. Conclusions: These results provide evidence of the potential LLMs may have for biomedical applications and highlight the importance of robust evaluation before adopting LLMs for any specific use cases. Notably, performant open-source LLMs such as Llama-3- 8B-Instruct and Flan-T5-XXL show promise for use cases where trustworthiness and data confidentiality are concerns, as these models can be hosted locally, offering better security, transparency, and explainability. Continuing to explore how these emerging technologies can be adapted for the healthcare setting, paired with human expertise, and enhanced through quality control measures will be important research to allow responsible innovation with LLMs in the biomedical area.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address several key issues of large language models (LLMs) in biomedical applications: 1. **Lack of Evaluation Framework**: There is currently a lack of a systematic framework to evaluate and compare the performance of different LLMs and their prompting strategies in biomedical natural language processing (NLP) tasks. 2. **Performance and Applicability**: Explore the performance of different types of LLMs (general-purpose and domain-specific) in biomedical NLP tasks, including named entity recognition (NER), relation extraction (RE), PICO element recognition, sentence similarity (SS), document classification (Class.), and question answering (QA). 3. **Impact of Prompting Strategies**: Investigate the impact of different prompting strategies (such as zero-shot, random few-shot, and semantically similar few-shot) on model performance to determine the optimal prompting method. 4. **Guidance on Model Selection**: Provide guidance on selecting appropriate LLMs or prompting strategies for specific biomedical or clinical tasks. Through these objectives, the paper hopes to provide a comprehensive evaluation framework for researchers and practitioners in the biomedical field, helping them better understand and utilize LLMs.

Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark : Comparative Study

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Benchmarking Large Language Models in Evidence-Based Medicine

Large language models for biomedicine: foundations, opportunities, challenges, and best practices

Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data

LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction

How well it works: Benchmarking performance of GPT models on medical natural language processing tasks

A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks

A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Large language models encode clinical knowledge

Large Language Models in Healthcare: A Comprehensive Benchmark

Comparison of Prompt Engineering and Fine-Tuning Strategies in Large Language Models in the Classification of Clinical Notes

Towards Evaluating and Building Versatile Large Language Models for Medicine

Large Language Model in Medical Information Extraction from Titles and Abstracts with Prompt Engineering Strategies: A Comparative Study of GPT-3.5 and GPT-4

MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

Large Language Model Benchmarks in Medical Tasks

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge