Abstract:Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs' development for healthcare and real-world clinical practice.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Can large - language models (LLMs) outperform traditional machine - learning (ML) models in clinical prediction tasks? Specifically, the paper constructs a new benchmark platform - ClinicalBench, aiming to comprehensively and systematically study the performance of general and medical - domain large - language models in clinical prediction tasks and compare them with traditional machine - learning models. Through this platform, the authors hope to explore the feasibility and limitations of LLMs in practical clinical prediction applications. ### Background and Objectives of the Paper 1. **Background**: - Large - language models (LLMs) have performed well in medical text - processing tasks and medical license examinations, showing their great potential in the medical field. - However, traditional machine - learning models (such as SVM, XGBoost, etc.) are still the main choices in clinical prediction tasks. 2. **Problems**: - **Core Problem**: Can LLMs outperform traditional machine - learning models in clinical prediction tasks? - **Specific Tasks**: The paper selects three common clinical prediction tasks: length - of - stay prediction, mortality prediction, and readmission prediction. ### Research Methods 1. **Data Sets**: - Two real - world clinical databases, MIMIC - III and MIMIC - IV, are used. 2. **Models**: - 11 traditional machine - learning models and 22 large - language models of different scales are compared, including 14 general LLMs and 8 medical LLMs. 3. **Evaluation Metrics**: - Macro F1 and AUROC are adopted as evaluation metrics to take into account the label - imbalance problem. ### Main Findings 1. **Direct Prompting**: - The performance of directly prompting LLMs is generally inferior to that of traditional machine - learning models. - Even when adjusting the decoding temperature or increasing the model - parameter scale, LLMs still cannot outperform traditional models. 2. **Prompt Engineering**: - Four common prompt strategies (zero - shot chain - of - thought, self - reflection, role - playing, context - learning) have limited improvement effects on LLMs. - Only on certain specific tasks (such as length - of - stay prediction), the context - learning strategy has a significant improvement on some LLMs, but still is generally inferior to traditional models. 3. **Fine - Tuning**: - The fine - tuning strategy has obvious improvements on some tasks (such as length - of - stay prediction and mortality prediction), but no improvement on readmission prediction. - Although fine - tuning can improve the performance of LLMs, most fine - tuned LLMs still cannot exceed typical traditional machine - learning models. ### Conclusions - **Main Contributions**: - The ClinicalBench benchmark platform is constructed, and for the first time, the performance of LLMs and traditional machine - learning models in clinical prediction tasks is systematically compared. - It is found that even using different model scales, prompt strategies, or fine - tuning methods, LLMs currently still cannot outperform traditional machine - learning models in clinical prediction tasks. - The potential deficiencies of LLMs in practical clinical applications are emphasized, and caution is called for in the practical application of LLMs. - **Future Directions**: - It is called for further research on how to improve the performance of LLMs in clinical reasoning and decision - making to narrow the gap with traditional machine - learning models.

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

CLIMB: A Benchmark of Clinical Bias in Large Language Models

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

Can Large Language Models Replace Data Scientists in Clinical Research?

Benchmarking Large Language Models in Evidence-Based Medicine

Large language models encode clinical knowledge

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data

Large Language Models in Healthcare: A Comprehensive Benchmark

Benchmarking the Confidence of Large Language Models in Clinical Questions

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Towards Evaluating and Building Versatile Large Language Models for Medicine

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Large Language Model Benchmarks in Medical Tasks