Open (Clinical) LLMs are Sensitive to Instruction Phrasings

Alberto Mario Ceballos Arroyo,Monica Munnangi,Jiuding Sun,Karen Y.C. Zhang,Denis Jered McInerney,Byron C. Wallace,Silvio Amir
2024-07-13
Abstract:Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data are especially brittle, compared to their general domain counterparts. Further, arbitrary phrasing differences can affect fairness, e.g., valid but distinct instructions for mortality prediction yield a range both in overall performance, and in terms of differences between demographic groups.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore the sensitivity of large language models (LLMs) to variations in instruction phrasing in clinical tasks. Specifically, the researchers focus on: 1. **Sensitivity to Instruction Phrasing**: Instruction-tuned large language models (LLMs) are highly sensitive to different phrasings of instructions when performing clinical natural language processing (NLP) tasks. This sensitivity can lead to significant differences in model performance. 2. **Specificity of the Clinical Domain**: In the healthcare domain, this sensitivity is particularly concerning because clinical practitioners are typically not expert prompt engineers, and inaccuracies in model output could have serious implications for patient health. 3. **Model Robustness**: The researchers evaluated the robustness of seven different LLMs (including general models and models specifically trained on clinical data) to natural (non-adversarial) variations in instruction phrasing by collecting instructions from different medical tasks. 4. **Fairness Issues**: The study also investigates how variations in instruction phrasing affect the fairness of model predictions, i.e., performance differences across different demographic groups. ### Main Findings 1. **Performance Differences**: The study found that all models exhibited significant performance differences when faced with different but semantically equivalent instructions. Specifically, in classification tasks, performance differences could reach 0.6 absolute AUROC points, and in information extraction tasks, they could reach 0.4 absolute F1 points. 2. **Performance of Clinical Models**: Although clinical-specific models performed well on certain tasks, general models often performed better overall on other tasks. Notably, the performance of clinical models significantly declined in the worst-case scenarios. 3. **Impact on Fairness**: Variations in instruction phrasing not only affect overall model performance but also lead to performance disparities across different demographic groups. For example, in mortality prediction tasks, the performance difference between white and non-white patients could reach 0.35 absolute AUROC points, and the difference between male and female patients could reach 0.19 absolute AUROC points. ### Conclusions and Recommendations 1. **Cautious Use**: Researchers recommend exercising caution when using instruction-tuned LLMs in high-risk clinical tasks, as even minor variations in instruction phrasing can lead to significantly different output results. 2. **Raising Awareness**: Clinical practitioners should be aware that seemingly innocuous variations in instruction phrasing can disproportionately affect specific demographic groups. 3. **Future Research Directions**: This work enhances our understanding of the robustness of LLMs in clinical tasks and aims to inspire researchers to develop new methods to improve model robustness. ### Limitations 1. **Scope of Models**: The study primarily focuses on open-source LLMs, which may not fully generalize to commercial models. 2. **Sample Representativeness**: Although the researchers made efforts to recruit a diverse group of medical professionals, the final participant sample may still not be representative of all potential users of these technologies. 3. **Visibility of Results**: Participants could not see the model's output results when writing instructions, which might have affected the quality and diversity of the instructions.