Abstract:The emergence of Large Language Models (LLMs) in the medical domain has stressed a compelling need for standard datasets to evaluate their question-answering (QA) performance. Although there have been several benchmark datasets for medical QA, they either cover common knowledge across different departments or are specific to another department rather than pediatrics. Moreover, some of them are limited to objective questions and do not measure the generation capacity of LLMs. Therefore, they cannot comprehensively assess the QA ability of LLMs in pediatrics. To fill this gap, we construct PediaBench, the first Chinese pediatric dataset for LLM evaluation. Specifically, it contains 4,565 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups. It adopts an integrated scoring criterion based on different difficulty levels to thoroughly assess the proficiency of an LLM in instruction following, knowledge understanding, clinical case analysis, etc. Finally, we validate the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs. Through an in-depth analysis of experimental results, we offer insights into the ability of LLMs to answer pediatric questions in the Chinese context, highlighting their limitations for further improvements. Our code and data are published at <a class="link-external link-https" href="https://github.com/ACMISLab/PediaBench" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the current lack of a standard dataset specifically used to evaluate the question - answering ability of large language models (LLMs) in the field of pediatrics. Although there are already some medical question - answering benchmark datasets, most of them cover general medical knowledge in different departments or are specific to other departments rather than pediatrics. In addition, these datasets are usually limited to objective questions and cannot comprehensively evaluate the ability of LLMs to generate medical texts. Therefore, the applicability of existing benchmark datasets in the field of pediatrics is limited and cannot fully evaluate the performance of LLMs in pediatric question - answering. To solve this problem, the paper introduces PediaBench - the first large - scale question - answering dataset specifically for Chinese pediatrics. PediaBench contains 4,565 objective questions and 1,632 subjective questions, covering 12 typical pediatric disease groups. By introducing diverse question types and comprehensive scoring criteria, PediaBench can more comprehensively evaluate the abilities of LLMs in instruction following, knowledge understanding, clinical case analysis, etc. Specifically, the main contributions of PediaBench include: 1. **Constructing a high - quality pediatric question - answering dataset**: PediaBench is the first large - scale question - answering dataset specifically for Chinese pediatrics, covering multiple question types and a wide range of pediatric diseases. 2. **Designing a comprehensive scoring scheme**: In order to accurately evaluate the performance of LLMs, PediaBench adopts a comprehensive scoring scheme that combines the difficulty coefficient and automatic scoring. 3. **Extensive experimental verification**: Through extensive experiments on 20 open - source and commercial LLMs, the effectiveness of PediaBench is verified, and a detailed performance analysis is provided, highlighting the limitations and improvement directions of current LLMs. Through these measures, PediaBench fills the gap of existing benchmark datasets in the field of pediatrics and provides a powerful tool for evaluating and improving the performance of LLMs in pediatric question - answering tasks.

PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

CHBench: A Chinese Dataset for Evaluating Health in Large Language Models

A Medical Multimodal Large Language Model for Pediatric Pneumonia

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models

CMB: A Comprehensive Medical Benchmark in Chinese

Large Language Model Benchmarks in Medical Tasks

Large Language Models in Healthcare: A Comprehensive Benchmark

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

Performance Evaluation of Lightweight Open-source Large Language Models in Pediatric Consultations: A Comparative Analysis

Towards Evaluating and Building Versatile Large Language Models for Medicine

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

SafetyBench: Evaluating the Safety of Large Language Models

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models