Abstract:With the accelerating development of Large Language Models (LLMs), many LLMs are beginning to be used in the Chinese K-12 education domain. The integration of LLMs and education is getting closer and closer, however, there is currently no benchmark for evaluating LLMs that focuses on the Chinese K-12 education domain. Therefore, there is an urgent need for a comprehensive natural language processing benchmark to accurately assess the capabilities of various LLMs in the Chinese K-12 education domain. To address this, we introduce the E-EVAL, the first comprehensive evaluation benchmark specifically designed for the Chinese K-12 education field. The E-EVAL consists of 4,351 multiple-choice questions at the primary, middle, and high school levels across a wide range of subjects, including Chinese, English, Politics, History, Ethics, Physics, Chemistry, Mathematics, and Geography. We conducted a comprehensive evaluation of E-EVAL on advanced LLMs, including both English-dominant and Chinese-dominant models. Findings show that Chinese-dominant models perform well compared to English-dominant models, with many scoring even above the GPT 4.0. However, almost all models perform poorly in complex subjects such as mathematics. We also found that most Chinese-dominant LLMs did not achieve higher scores at the primary school level compared to the middle school level. We observe that the mastery of higher-order knowledge by the model does not necessarily imply the mastery of lower-order knowledge as well. Additionally, the experimental results indicate that the Chain of Thought (CoT) technique is effective only for the challenging science subjects, while Few-shot prompting is more beneficial for liberal arts subjects. With E-EVAL, we aim to analyze the strengths and limitations of LLMs in educational applications, and to contribute to the progress and development of Chinese K-12 education and LLMs.

Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation

Domain Mastery Benchmark: An Ever-Updating Benchmark for Evaluating Holistic Domain Knowledge of Large Language Model--A Preliminary Release

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Benchmarking Foundation Models with Language-Model-as-an-Examiner

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data

EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

WYWEB: A NLP Evaluation Benchmark For Classical Chinese