E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

Jinchang Hou,Chang Ao,Haihong Wu,Xiangtao Kong,Zhigang Zheng,Daijia Tang,Chengming Li,Xiping Hu,Ruifeng Xu,Shiwen Ni,Min Yang
2024-01-29
Abstract:With the accelerating development of Large Language Models (LLMs), many LLMs are beginning to be used in the Chinese K-12 education domain. The integration of LLMs and education is getting closer and closer, however, there is currently no benchmark for evaluating LLMs that focuses on the Chinese K-12 education domain. Therefore, there is an urgent need for a comprehensive natural language processing benchmark to accurately assess the capabilities of various LLMs in the Chinese K-12 education domain. To address this, we introduce the E-EVAL, the first comprehensive evaluation benchmark specifically designed for the Chinese K-12 education field. The E-EVAL consists of 4,351 multiple-choice questions at the primary, middle, and high school levels across a wide range of subjects, including Chinese, English, Politics, History, Ethics, Physics, Chemistry, Mathematics, and Geography. We conducted a comprehensive evaluation of E-EVAL on advanced LLMs, including both English-dominant and Chinese-dominant models. Findings show that Chinese-dominant models perform well compared to English-dominant models, with many scoring even above the GPT 4.0. However, almost all models perform poorly in complex subjects such as mathematics. We also found that most Chinese-dominant LLMs did not achieve higher scores at the primary school level compared to the middle school level. We observe that the mastery of higher-order knowledge by the model does not necessarily imply the mastery of lower-order knowledge as well. Additionally, the experimental results indicate that the Chain of Thought (CoT) technique is effective only for the challenging science subjects, while Few-shot prompting is more beneficial for liberal arts subjects. With E-EVAL, we aim to analyze the strengths and limitations of LLMs in educational applications, and to contribute to the progress and development of Chinese K-12 education and LLMs.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of the lack of a comprehensive large language model (LLMs) evaluation benchmark in the field of K-12 education in China. With the rapid development of large language models in the fields of natural language processing and artificial intelligence, these models have begun to be applied in China's K-12 education system. However, there is currently no evaluation standard specifically for this field, which limits the ability of researchers and educators to accurately assess the performance of different models in the Chinese K-12 education environment. Therefore, the paper proposes E-EVAL, the first comprehensive evaluation benchmark designed specifically to assess the capabilities of large language models in the Chinese K-12 education field. E-EVAL includes 4,351 multiple-choice questions covering various subjects in primary, middle, and high school, such as Chinese, English, politics, history, ethics, physics, chemistry, mathematics, and geography. Through this benchmark, researchers can comprehensively evaluate the performance of various large language models in the Chinese K-12 education field, thereby analyzing the strengths and limitations of the models and promoting the development of K-12 education and large language models in China.