ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

Canyu Chen,Jian Yu,Shan Chen,Che Liu,Zhongwei Wan,Danielle Bitterman,Fei Wang,Kai Shu
2024-11-10
Abstract:Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs' development for healthcare and real-world clinical practice.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Can large - language models (LLMs) outperform traditional machine - learning (ML) models in clinical prediction tasks? Specifically, the paper constructs a new benchmark platform - ClinicalBench, aiming to comprehensively and systematically study the performance of general and medical - domain large - language models in clinical prediction tasks and compare them with traditional machine - learning models. Through this platform, the authors hope to explore the feasibility and limitations of LLMs in practical clinical prediction applications. ### Background and Objectives of the Paper 1. **Background**: - Large - language models (LLMs) have performed well in medical text - processing tasks and medical license examinations, showing their great potential in the medical field. - However, traditional machine - learning models (such as SVM, XGBoost, etc.) are still the main choices in clinical prediction tasks. 2. **Problems**: - **Core Problem**: Can LLMs outperform traditional machine - learning models in clinical prediction tasks? - **Specific Tasks**: The paper selects three common clinical prediction tasks: length - of - stay prediction, mortality prediction, and readmission prediction. ### Research Methods 1. **Data Sets**: - Two real - world clinical databases, MIMIC - III and MIMIC - IV, are used. 2. **Models**: - 11 traditional machine - learning models and 22 large - language models of different scales are compared, including 14 general LLMs and 8 medical LLMs. 3. **Evaluation Metrics**: - Macro F1 and AUROC are adopted as evaluation metrics to take into account the label - imbalance problem. ### Main Findings 1. **Direct Prompting**: - The performance of directly prompting LLMs is generally inferior to that of traditional machine - learning models. - Even when adjusting the decoding temperature or increasing the model - parameter scale, LLMs still cannot outperform traditional models. 2. **Prompt Engineering**: - Four common prompt strategies (zero - shot chain - of - thought, self - reflection, role - playing, context - learning) have limited improvement effects on LLMs. - Only on certain specific tasks (such as length - of - stay prediction), the context - learning strategy has a significant improvement on some LLMs, but still is generally inferior to traditional models. 3. **Fine - Tuning**: - The fine - tuning strategy has obvious improvements on some tasks (such as length - of - stay prediction and mortality prediction), but no improvement on readmission prediction. - Although fine - tuning can improve the performance of LLMs, most fine - tuned LLMs still cannot exceed typical traditional machine - learning models. ### Conclusions - **Main Contributions**: - The ClinicalBench benchmark platform is constructed, and for the first time, the performance of LLMs and traditional machine - learning models in clinical prediction tasks is systematically compared. - It is found that even using different model scales, prompt strategies, or fine - tuning methods, LLMs currently still cannot outperform traditional machine - learning models in clinical prediction tasks. - The potential deficiencies of LLMs in practical clinical applications are emphasized, and caution is called for in the practical application of LLMs. - **Future Directions**: - It is called for further research on how to improve the performance of LLMs in clinical reasoning and decision - making to narrow the gap with traditional machine - learning models.