Karan Singhal,Shekoofeh Azizi,Tao Tu,S. Sara Mahdavi,Jason Wei,Hyung Won Chung,Nathan Scales,Ajay Tanwani,Heather Cole-Lewis,Stephen Pfohl,Perry Payne,Martin Seneviratne,Paul Gamble,Chris Kelly,Abubakr Babiker,Nathanael Schärli,Aakanksha Chowdhery,Philip Mansfield,Dina Demner-Fushman,Blaise Agüera y Arcas,Dale Webster,Greg S. Corrado,Yossi Matias,Katherine Chou,Juraj Gottweis,Nenad Tomasev,Yun Liu,Alvin Rajkomar,Joelle Barral,Christopher Semturs,Alan Karthikesalingam,Vivek Natarajan

Abstract:Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model 1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM 2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA 3 , MedMCQA 4 , PubMedQA 5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics 6 ), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Benchmarking Automated Clinical Language Simplification: Dataset, Algorithm, and Evaluation

Multilingual Simplification of Medical Texts

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

Automated Lay Language Summarization of Biomedical Scientific Reviews

Towards Evaluating and Building Versatile Large Language Models for Medicine

Investigating Large Language Models and Control Mechanisms to Improve Text Readability of Biomedical Abstracts

Large Language Models in Healthcare: A Comprehensive Benchmark

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Paragraph-level Simplification of Medical Texts

Towards more patient friendly clinical notes through language models and ontologies

CLIMB: A Benchmark of Clinical Bias in Large Language Models

Society of Medical Simplifiers

Large Language Models for Biomedical Text Simplification: Promising But Not There Yet

Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

MedCalc-Bench: Evaluating Large Language Models for Medical Calculations

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Large language models encode clinical knowledge