CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

Mingyu Derek Ma,Chenchen Ye,Yu Yan,Xiaoxuan Wang,Peipei Ping,Timothy S Chang,Wei Wang

2024-10-12

Abstract:The integration of Artificial Intelligence (AI), especially Large Language Models (LLMs), into the clinical diagnosis process offers significant potential to improve the efficiency and accessibility of medical care. While LLMs have shown some promise in the medical domain, their application in clinical diagnosis remains underexplored, especially in real-world clinical practice, where highly sophisticated, patient-specific decisions need to be made. Current evaluations of LLMs in this field are often narrow in scope, focusing on specific diseases or specialties and employing simplified diagnostic tasks. To bridge this gap, we introduce CliBench, a novel benchmark developed from the MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnoses from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions. Supported by structured output ontologies, CliBench enables a precise and multi-granular evaluation, offering an in-depth understanding of LLM's capability on diverse clinical tasks of desired granularity. We conduct a zero-shot evaluation of leading LLMs to assess their proficiency in clinical decision-making. Our preliminary results shed light on the potential and limitations of current LLMs in clinical settings, providing valuable insights for future advancements in LLM-powered healthcare.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that there are limitations in the current application of large - language models (LLMs) in clinical diagnosis. Although LLMs perform well in tasks such as medical knowledge Q&A, their application in actual clinical diagnosis is still insufficient, especially when facing real - world clinical practice that requires highly complex and personalized decision - making. Existing evaluation methods are usually narrow in scope, focusing on specific diseases or professional fields, and adopting simplified diagnostic tasks, failing to fully reflect the complexity of real - world clinical decision - making. In addition, these studies often overlook other important clinical decisions, such as ordering laboratory tests, selecting treatment procedures, and prescribing medications. To this end, the paper introduces a new benchmark named CLIBENCH, aiming to provide a comprehensive and realistic evaluation framework for evaluating the capabilities of LLMs in clinical diagnosis. CLIBENCH is constructed based on the MIMIC IV dataset, covering a wide range of cases in multiple specialties, including not only diagnostic tasks but also multi - aspect clinical decision - making tasks such as treatment procedure identification, laboratory test ordering, and drug prescription. In this way, CLIBENCH can more accurately evaluate the performance of LLMs in different clinical tasks, providing valuable insights for the future development of LLM - driven healthcare.

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Large Language Models in Healthcare: A Comprehensive Benchmark

Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making

Towards Evaluating and Building Versatile Large Language Models for Medicine

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Large Language Model Benchmarks in Medical Tasks

AI Hospital: Interactive Evaluation and Collaboration of LLMs As Intern Doctors for Clinical Diagnosis

CLIMB: A Benchmark of Clinical Bias in Large Language Models

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy

Large language models encode clinical knowledge

Benchmarking Large Language Models in Evidence-Based Medicine

A comparison of the diagnostic ability of large language models in challenging clinical cases

Benchmarking the Confidence of Large Language Models in Clinical Questions

Evaluating large language models as agents in the clinic