Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Fenglin Liu,Zheng Li,Hongjian Zhou,Qingyu Yin,Jingfeng Yang,Xianfeng Tang,Chen Luo,Ming Zeng,Haoming Jiang,Yifan Gao,Priyanka Nigam,Sreyashi Nag,Bing Yin,Yining Hua,Xuan Zhou,Omid Rohanian,Anshul Thakur,Lei Clifton,David Clifton

DOI: https://doi.org/10.1101/2024.04.24.24306315

2024-10-16

Abstract:The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs. The benchmark data is available at https://github.com/AI-in-Health/ClinicBench.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the performance of large - language models (LLMs) in clinical decision - support, especially their capabilities in handling complex clinical tasks such as open - ended questions, long - document processing, and understanding of new drugs. Although existing research shows that LLMs perform well in closed - ended question - answering tasks, how these models perform in actual clinical applications remains an unsolved problem. The paper constructs a comprehensive benchmarking platform - ClinicBench - to evaluate the performance of 22 different LLMs on 11 clinical tasks, which cover three main scenarios: clinical language reasoning, generation, and understanding. In addition, the paper also explores the impact of different types of instruction - fine - tuning data on the performance of medical LLMs, aiming to provide guidance for future research and applications. Specifically, the main contributions of the paper include: 1. Constructing ClinicBench, which contains 3 scenarios, 11 tasks, and 17 datasets, for evaluating 22 LLMs in zero - shot and few - shot settings. 2. Creating 6 new datasets specifically for complex problems in clinical practice, such as open - ended decision - making, long - document processing, and new - drug analysis. 3. Conducting human evaluations to measure the practicality of LLMs in the clinical field. 4. Exploring the possibility of using clinical - standard knowledge bases as fine - tuning data and analyzing the impact of different fine - tuning data types on the performance of medical LLMs. Through these efforts, the paper reveals that current LLMs can be comparable to human experts when handling structured exam - type tasks, but perform poorly in actual clinical tasks such as open - ended questions, long - document processing, and understanding of new drugs. This indicates that although LLMs show potential in some aspects, they still face many challenges in actual clinical applications and require further research and improvement.

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Large Language Models in Healthcare: A Comprehensive Benchmark

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

Large language models encode clinical knowledge

Large Language Model Benchmarks in Medical Tasks

Benchmarking the Confidence of Large Language Models in Clinical Questions

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Large language models in medical and healthcare fields: applications, advances, and challenges

Large Language Models as Agents in the Clinic

Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

Towards Evaluating and Building Versatile Large Language Models for Medicine

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making

Benchmarking Large Language Models in Evidence-Based Medicine

Evaluating large language models in medical applications: a survey

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models