Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Fenglin Liu,Zheng Li,Hongjian Zhou,Qingyu Yin,Jingfeng Yang,Xianfeng Tang,Chen Luo,Ming Zeng,Haoming Jiang,Yifan Gao,Priyanka Nigam,Sreyashi Nag,Bing Yin,Yining Hua,Xuan Zhou,Omid Rohanian,Anshul Thakur,Lei Clifton,David Clifton
DOI: https://doi.org/10.1101/2024.04.24.24306315
2024-10-16
Abstract:The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs. The benchmark data is available at https://github.com/AI-in-Health/ClinicBench.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the performance of large - language models (LLMs) in clinical decision - support, especially their capabilities in handling complex clinical tasks such as open - ended questions, long - document processing, and understanding of new drugs. Although existing research shows that LLMs perform well in closed - ended question - answering tasks, how these models perform in actual clinical applications remains an unsolved problem. The paper constructs a comprehensive benchmarking platform - ClinicBench - to evaluate the performance of 22 different LLMs on 11 clinical tasks, which cover three main scenarios: clinical language reasoning, generation, and understanding. In addition, the paper also explores the impact of different types of instruction - fine - tuning data on the performance of medical LLMs, aiming to provide guidance for future research and applications. Specifically, the main contributions of the paper include: 1. Constructing ClinicBench, which contains 3 scenarios, 11 tasks, and 17 datasets, for evaluating 22 LLMs in zero - shot and few - shot settings. 2. Creating 6 new datasets specifically for complex problems in clinical practice, such as open - ended decision - making, long - document processing, and new - drug analysis. 3. Conducting human evaluations to measure the practicality of LLMs in the clinical field. 4. Exploring the possibility of using clinical - standard knowledge bases as fine - tuning data and analyzing the impact of different fine - tuning data types on the performance of medical LLMs. Through these efforts, the paper reveals that current LLMs can be comparable to human experts when handling structured exam - type tasks, but perform poorly in actual clinical tasks such as open - ended questions, long - document processing, and understanding of new drugs. This indicates that although LLMs show potential in some aspects, they still face many challenges in actual clinical applications and require further research and improvement.