Abstract:With the proliferation of Large Language Models (LLMs) in diverse domains, there is a particular need for unified evaluation standards in clinical medical scenarios, where models need to be examined very thoroughly. We present CliMedBench, a comprehensive benchmark with 14 expert-guided core clinical scenarios specifically designed to assess the medical ability of LLMs across 7 pivot dimensions. It comprises 33,735 questions derived from real-world medical reports of top-tier tertiary hospitals and authentic examination exercises. The reliability of this benchmark has been confirmed in several ways. Subsequent experiments with existing LLMs have led to the following findings: (i) Chinese medical LLMs underperform on this benchmark, especially where medical reasoning and factual consistency are vital, underscoring the need for advances in clinical knowledge and diagnostic accuracy. (ii) Several general-domain LLMs demonstrate substantial potential in medical clinics, while the limited input capacity of many medical LLMs hinders their practical use. These findings reveal both the strengths and limitations of LLMs in clinical scenarios and offer critical insights for medical research.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the lack of a unified evaluation standard for large language models (LLMs) in Chinese clinical medicine scenarios. Specifically: 1. **Lack of Evaluation Standards**: Currently, despite the emergence of numerous large language models in various fields, there is a lack of a systematic and comprehensive evaluation standard to thoroughly test these models' capabilities in Chinese clinical medicine scenarios. 2. **Insufficient Existing Benchmarks**: Existing medical evaluation benchmarks are mainly based on open educational resources, which have a significant gap with actual medical practice and cannot truly reflect the complexity and challenges of clinical settings. 3. **Performance Evaluation Needs**: To effectively integrate these models into clinical practice, a standardized evaluation benchmark is needed to comprehensively assess their performance in terms of response accuracy, hallucination rate, and content safety. ### Solution To this end, the authors propose **CliMedBench**, a large-scale Chinese benchmark that includes 14 expert-guided core clinical scenarios for evaluating the medical capabilities of large language models across 7 key dimensions. These dimensions include clinical question answering, knowledge application, reasoning, information retrieval, summarization ability, hallucination, and toxicity. ### Main Findings 1. **Poor Performance of Chinese Medical LLMs**: Particularly in tasks requiring medical reasoning and factual consistency, Chinese medical LLMs perform poorly, highlighting the need for further improvement in clinical knowledge and diagnostic accuracy. 2. **Significant Potential of General Domain LLMs**: Some general domain LLMs show significant potential in medical clinical settings, but many medical LLMs have limited input capacity, restricting their practical application. 3. **Negative Impact of Uncertainty**: Uncertainty in medical contexts can significantly affect the accuracy of model-generated responses. Through these findings, the paper provides important insights for medical research and points out future research directions to enhance the capabilities of LLMs in clinical scenarios.

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

CMB: A Comprehensive Medical Benchmark in Chinese

Large Language Models in Healthcare: A Comprehensive Benchmark

Large Language Model Benchmarks in Medical Tasks

Towards Evaluating and Building Versatile Large Language Models for Medicine

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

Large language models encode clinical knowledge

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

MedCalc-Bench: Evaluating Large Language Models for Medical Calculations

Benchmarking Large Language Models in Evidence-Based Medicine

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

Benchmarking the Confidence of Large Language Models in Clinical Questions

Evaluating large language models in medical applications: a survey

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries