Abstract:Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data are especially brittle, compared to their general domain counterparts. Further, arbitrary phrasing differences can affect fairness, e.g., valid but distinct instructions for mortality prediction yield a range both in overall performance, and in terms of differences between demographic groups.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore the sensitivity of large language models (LLMs) to variations in instruction phrasing in clinical tasks. Specifically, the researchers focus on: 1. **Sensitivity to Instruction Phrasing**: Instruction-tuned large language models (LLMs) are highly sensitive to different phrasings of instructions when performing clinical natural language processing (NLP) tasks. This sensitivity can lead to significant differences in model performance. 2. **Specificity of the Clinical Domain**: In the healthcare domain, this sensitivity is particularly concerning because clinical practitioners are typically not expert prompt engineers, and inaccuracies in model output could have serious implications for patient health. 3. **Model Robustness**: The researchers evaluated the robustness of seven different LLMs (including general models and models specifically trained on clinical data) to natural (non-adversarial) variations in instruction phrasing by collecting instructions from different medical tasks. 4. **Fairness Issues**: The study also investigates how variations in instruction phrasing affect the fairness of model predictions, i.e., performance differences across different demographic groups. ### Main Findings 1. **Performance Differences**: The study found that all models exhibited significant performance differences when faced with different but semantically equivalent instructions. Specifically, in classification tasks, performance differences could reach 0.6 absolute AUROC points, and in information extraction tasks, they could reach 0.4 absolute F1 points. 2. **Performance of Clinical Models**: Although clinical-specific models performed well on certain tasks, general models often performed better overall on other tasks. Notably, the performance of clinical models significantly declined in the worst-case scenarios. 3. **Impact on Fairness**: Variations in instruction phrasing not only affect overall model performance but also lead to performance disparities across different demographic groups. For example, in mortality prediction tasks, the performance difference between white and non-white patients could reach 0.35 absolute AUROC points, and the difference between male and female patients could reach 0.19 absolute AUROC points. ### Conclusions and Recommendations 1. **Cautious Use**: Researchers recommend exercising caution when using instruction-tuned LLMs in high-risk clinical tasks, as even minor variations in instruction phrasing can lead to significantly different output results. 2. **Raising Awareness**: Clinical practitioners should be aware that seemingly innocuous variations in instruction phrasing can disproportionately affect specific demographic groups. 3. **Future Research Directions**: This work enhances our understanding of the robustness of LLMs in clinical tasks and aims to inspire researchers to develop new methods to improve model robustness. ### Limitations 1. **Scope of Models**: The study primarily focuses on open-source LLMs, which may not fully generalize to commercial models. 2. **Sample Representativeness**: Although the researchers made efforts to recruit a diverse group of medical professionals, the final participant sample may still not be representative of all potential users of these technologies. 3. **Visibility of Results**: Participants could not see the model's output results when writing instructions, which might have affected the quality and diversity of the instructions.

Open (Clinical) LLMs are Sensitive to Instruction Phrasings

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Large language models encode clinical knowledge

An Active Inference Strategy for Prompting Reliable Responses from Large Language Models in Medical Practice

Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain

Evaluating large language models on medical, lay language, and self-reported descriptions of genetic conditions

Instruction-tuned Large Language Models for Machine Translation in the Medical Domain

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks

Resilience of Large Language Models for Noisy Instructions

Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark : Comparative Study

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Large Language Model Prompting Techniques for Advancement in Clinical Medicine

A Survey of Clinicians’ Views of the Utility of Large Language Models

Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data

Clinical Accuracy, Relevance, Clarity, and Emotional Sensitivity of Large Language Models to Surgical Patient Questions: Cross-Sectional Study

Large language models in solving clinical dilemmas - advantages and drawbacks

Large language models in medical and healthcare fields: applications, advances, and challenges

Large Language Models in Healthcare: A Comprehensive Benchmark

Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making