Abstract:Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that GPT-4 had the best accuracy of 61.6% and a weighted F1 score of 0.617. These results highlight a great disparity when compared to human accuracy, which stood at 71.6%. For explanation tasks, while LLMs could generate relevant reasoning and demonstrate improved performance after finetuning, they fall short of a desired standard, indicating ample room for improvement. To the best of our knowledge, CMExam is the first Chinese medical exam dataset to provide comprehensive medical annotations. The experiments and findings of LLM evaluation also provide valuable insights into the challenges and potential solutions in developing Chinese medical QA systems and LLM evaluation pipelines. The dataset and relevant code are available at <a class="link-external link-https" href="https://github.com/williamliujl/CMExam" rel="external noopener nofollow">this https URL</a>.

CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset

Cross-Domain Learning Based Traditional Chinese Medicine Medical Record Classification.

DuSQL: A Large-Scale and Pragmatic Chinese Text-to-SQL Dataset.

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

CATS: A Pragmatic Chinese Answer-to-Sequence Dataset with Large Scale and High Quality

CSL: A Large-scale Chinese Scientific Literature Dataset

SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset

MCSCSet: A Specialist-annotated Dataset for Medical-domain Chinese Spelling Correction

Huatuo-26M, a Large-scale Chinese Medical QA Dataset

TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models

EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records

EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records

Data Augmentation with Hierarchical SQL-to-Question Generation for Cross-domain Text-to-SQL Parsing

A Dataset of Open-Domain Question Answering with Multiple-Span Answers

CHBench: A Chinese Dataset for Evaluating Health in Large Language Models

MCTS: A Multi-Reference Chinese Text Simplification Dataset

SA-SQL: A Schema-Aligned Framework for Text-to-SQL Through Large Language Models

MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Ar-Spider: Text-to-SQL in Arabic

BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain