PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

Qian Zhang,Panfeng Chen,Jiali Li,Linkun Feng,Shuyu Liu,Mei Chen,Hui Li,Yanhao Wang
2024-12-09
Abstract:The emergence of Large Language Models (LLMs) in the medical domain has stressed a compelling need for standard datasets to evaluate their question-answering (QA) performance. Although there have been several benchmark datasets for medical QA, they either cover common knowledge across different departments or are specific to another department rather than pediatrics. Moreover, some of them are limited to objective questions and do not measure the generation capacity of LLMs. Therefore, they cannot comprehensively assess the QA ability of LLMs in pediatrics. To fill this gap, we construct PediaBench, the first Chinese pediatric dataset for LLM evaluation. Specifically, it contains 4,565 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups. It adopts an integrated scoring criterion based on different difficulty levels to thoroughly assess the proficiency of an LLM in instruction following, knowledge understanding, clinical case analysis, etc. Finally, we validate the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs. Through an in-depth analysis of experimental results, we offer insights into the ability of LLMs to answer pediatric questions in the Chinese context, highlighting their limitations for further improvements. Our code and data are published at <a class="link-external link-https" href="https://github.com/ACMISLab/PediaBench" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the current lack of a standard dataset specifically used to evaluate the question - answering ability of large language models (LLMs) in the field of pediatrics. Although there are already some medical question - answering benchmark datasets, most of them cover general medical knowledge in different departments or are specific to other departments rather than pediatrics. In addition, these datasets are usually limited to objective questions and cannot comprehensively evaluate the ability of LLMs to generate medical texts. Therefore, the applicability of existing benchmark datasets in the field of pediatrics is limited and cannot fully evaluate the performance of LLMs in pediatric question - answering. To solve this problem, the paper introduces PediaBench - the first large - scale question - answering dataset specifically for Chinese pediatrics. PediaBench contains 4,565 objective questions and 1,632 subjective questions, covering 12 typical pediatric disease groups. By introducing diverse question types and comprehensive scoring criteria, PediaBench can more comprehensively evaluate the abilities of LLMs in instruction following, knowledge understanding, clinical case analysis, etc. Specifically, the main contributions of PediaBench include: 1. **Constructing a high - quality pediatric question - answering dataset**: PediaBench is the first large - scale question - answering dataset specifically for Chinese pediatrics, covering multiple question types and a wide range of pediatric diseases. 2. **Designing a comprehensive scoring scheme**: In order to accurately evaluate the performance of LLMs, PediaBench adopts a comprehensive scoring scheme that combines the difficulty coefficient and automatic scoring. 3. **Extensive experimental verification**: Through extensive experiments on 20 open - source and commercial LLMs, the effectiveness of PediaBench is verified, and a detailed performance analysis is provided, highlighting the limitations and improvement directions of current LLMs. Through these measures, PediaBench fills the gap of existing benchmark datasets in the field of pediatrics and provides a powerful tool for evaluating and improving the performance of LLMs in pediatric question - answering tasks.