Abstract:Large language models (LLMs) have performed remarkably well in various natural language processing tasks by benchmarking, including in the Western medical domain. However, the professional evaluation benchmarks for LLMs have yet to be covered in the traditional Chinese medicine(TCM) domain, which has a profound history and vast influence. To address this research gap, we introduce TCM-Bench, an comprehensive benchmark for evaluating LLM performance in TCM. It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis. It covers the core components of TCMLE, including TCM basis and clinical practice. To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions. It comprehensively considers the consistency of TCM semantics and knowledge. After conducting comprehensive experimental analyses from diverse perspectives, we can obtain the following findings: (1) The unsatisfactory performance of LLMs on this benchmark underscores their significant room for improvement in TCM. (2) Introducing domain knowledge can enhance LLMs' performance. However, for in-domain models like ZhongJing-TCM, the quality of generated analysis text has decreased, and we hypothesize that their fine-tuning process affects the basic LLM capabilities. (3) Traditional metrics for text generation quality like Rouge and BertScore are susceptible to text length and surface semantic ambiguity, while domain-specific metrics such as TCMScore can further supplement and explain their evaluation results. These findings highlight the capabilities and limitations of LLMs in the TCM and aim to provide a more profound assistance to medical research.

CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare

CMB: A Comprehensive Medical Benchmark in Chinese

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

Towards Evaluating and Building Versatile Large Language Models for Medicine

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Large Language Models in Healthcare: A Comprehensive Benchmark

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

Large Language Model Benchmarks in Medical Tasks

Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

Towards Democratizing Multilingual Large Language Models For Medicine Through A Two-Stage Instruction Fine-tuning Approach

Comparative Analysis of Large Language Models in Chinese Medical Named Entity Recognition

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

AlpaCare:Instruction-tuned Large Language Models for Medical Application

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

CHBench: A Chinese Dataset for Evaluating Health in Large Language Models