Abstract:The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefore, we propose a test to measure the multitask accuracy of large Chinese language models. This test encompasses four major domains, including medicine, law, psychology, and education, with 15 subtasks in medicine and 8 subtasks in education. We found that the best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. Across the four major domains, the highest average zero-shot accuracy of all models is 0.512. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot accuracy of 0.693 in clinical medicine, which was the highest accuracy among all models across all subtasks. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239. By comprehensively evaluating the breadth and depth of knowledge across multiple disciplines, this test can more accurately identify the shortcomings of the models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: currently, large - scale Chinese language models are booming, but there is a lack of corresponding performance evaluation methods. Therefore, the author proposes a testing method to measure the accuracy of large - scale Chinese language models in multi - task scenarios. This test covers four major fields (medicine, law, psychology, and education) and includes multiple subtasks to comprehensively evaluate these models' understanding and problem - solving abilities in different disciplines. Specifically, this research aims to: 1. **Fill the evaluation gap**: Provide a scientific evaluation method for large - scale Chinese language models and create a high - quality Chinese evaluation data set. 2. **Evaluate model capabilities**: Through multi - task tests covering a wide range of disciplines, evaluate the performance of these models in different fields, including the accuracy in zero - shot and few - shot settings. 3. **Identify deficiencies**: By comprehensively evaluating the breadth and depth of models in multiple disciplines, more accurately identify the deficiencies of models and provide directions for future improvements. For example, it is mentioned in the article: - In the zero - shot setting, the best - performing model is on average nearly 18.6 percentage points higher than the worst - performing model. - Among all subtasks, GPT - 3.5 - turbo achieved the highest zero - shot accuracy rate of 0.693 in the clinical medicine subtask. - All models perform poorly in the legal field, with the highest zero - shot accuracy rate being only 0.239. These results indicate that although large - scale models have made significant progress, they still have not reached the expert level in specific fields, especially in the legal field, where the performance of models is close to the random level. Therefore, future research should focus on how to improve the accuracy of models in vertical - domain tasks.

Measuring Massive Multitask Chinese Understanding

Hierarchical and Bidirectional Joint Multi-Task Classifiers for Natural Language Understanding

CMMLU: Measuring massive multitask language understanding in Chinese

M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

Large Language Models in Healthcare: A Comprehensive Benchmark

MULTI: Multimodal Understanding Leaderboard with Text and Images

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

Measuring Taiwanese Mandarin Language Understanding

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

Efficiently Measuring the Cognitive Ability of LLMs: an Adaptive Testing Perspective

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

ToMBench: Benchmarking Theory of Mind in Large Language Models

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios