Measuring Massive Multitask Chinese Understanding

Hui Zeng
2023-05-16
Abstract:The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefore, we propose a test to measure the multitask accuracy of large Chinese language models. This test encompasses four major domains, including medicine, law, psychology, and education, with 15 subtasks in medicine and 8 subtasks in education. We found that the best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. Across the four major domains, the highest average zero-shot accuracy of all models is 0.512. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot accuracy of 0.693 in clinical medicine, which was the highest accuracy among all models across all subtasks. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239. By comprehensively evaluating the breadth and depth of knowledge across multiple disciplines, this test can more accurately identify the shortcomings of the models.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: currently, large - scale Chinese language models are booming, but there is a lack of corresponding performance evaluation methods. Therefore, the author proposes a testing method to measure the accuracy of large - scale Chinese language models in multi - task scenarios. This test covers four major fields (medicine, law, psychology, and education) and includes multiple subtasks to comprehensively evaluate these models' understanding and problem - solving abilities in different disciplines. Specifically, this research aims to: 1. **Fill the evaluation gap**: Provide a scientific evaluation method for large - scale Chinese language models and create a high - quality Chinese evaluation data set. 2. **Evaluate model capabilities**: Through multi - task tests covering a wide range of disciplines, evaluate the performance of these models in different fields, including the accuracy in zero - shot and few - shot settings. 3. **Identify deficiencies**: By comprehensively evaluating the breadth and depth of models in multiple disciplines, more accurately identify the deficiencies of models and provide directions for future improvements. For example, it is mentioned in the article: - In the zero - shot setting, the best - performing model is on average nearly 18.6 percentage points higher than the worst - performing model. - Among all subtasks, GPT - 3.5 - turbo achieved the highest zero - shot accuracy rate of 0.693 in the clinical medicine subtask. - All models perform poorly in the legal field, with the highest zero - shot accuracy rate being only 0.239. These results indicate that although large - scale models have made significant progress, they still have not reached the expert level in specific fields, especially in the legal field, where the performance of models is close to the random level. Therefore, future research should focus on how to improve the accuracy of models in vertical - domain tasks.