MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Mianxin Liu,Jinru Ding,Jie Xu,Weiguo Hu,Xiaoyang Li,Lifeng Zhu,Zhian Bai,Xiaoming Shi,Benyou Wang,Haitao Song,Pengfei Liu,Xiaofan Zhang,Shanshan Wang,Kang Li,Haofen Wang,Tong Ruan,Xuanjing Huang,Xin Sun,Shaoting Zhang

2024-06-24

Abstract:Ensuring the general efficacy and goodness for human beings from medical large language models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce "MedBench", a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300,901 questions) to cover 43 clinical specialties and performs multi-facet evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations for question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals' perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at <a class="link-external link-https" href="https://medbench.opencompass.org.cn" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the following issues: Currently, in the Chinese context, there is a lack of a widely recognized and easily accessible evaluation framework for Medical Large Language Models (MLLMs). Although there are some existing benchmark systems (such as MLEC-QA, CMExam, CBLUE, and CMB), they have certain limitations, such as insufficient coverage, lack of standardized evaluation infrastructure, and reliability issues (such as shortcut learning and answer leakage). Therefore, the paper proposes a new benchmark system called MedBench, specifically designed to evaluate Chinese medical large language models. The main features of MedBench include: 1. **Comprehensive evaluation dataset**: Covering 43 clinical specialties with 300,901 questions. 2. **Standardized cloud evaluation infrastructure**: Ensuring consistency and automation in the evaluation process. 3. **Dynamic evaluation mechanism**: Preventing issues of shortcut learning and memorizing answers, thereby improving the reliability of evaluation results. With these features, MedBench can provide a comprehensive, standardized, and reliable evaluation system, laying the foundation for the practical application of Chinese medical large language models.

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

CMB: A Comprehensive Medical Benchmark in Chinese

Towards Evaluating and Building Versatile Large Language Models for Medicine

PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

CHBench: A Chinese Dataset for Evaluating Health in Large Language Models

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain

CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

MedCalc-Bench: Evaluating Large Language Models for Medical Calculations

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

Large Language Model Benchmarks in Medical Tasks

BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

A Benchmark for Long-Form Medical Question Answering