Abstract:Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may ``game" these benchmarks due to data leakage, achieving high scores while struggling with tasks simple for humans. To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao), and conduct ``closed-book" evaluations for representative models released prior to Gaokao. Contrary to prevailing consensus, even after addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: 1) anomalous consistent performance across various question difficulties, and 2) high variance in performance on questions of similar difficulty. In addition, We identified inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns. we find that the phenomenons are well-grounded in the motivations behind OpenAI o1, and o1's reasoning-as-difficulties can mitigate the mismatch. These results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore whether the high scores of large - language models (LLMs) truly reflect their performance that matches human capabilities. Specifically, the paper attempts to solve the following key issues: 1. **Validity of evaluation criteria**: - Currently, the evaluation of LLMs mainly depends on artificially - designed benchmark tests. These benchmark tests assume that higher scores mean stronger human - like performance. However, there are growing concerns that LLMs may "exploit" data leakage problems in these benchmark tests to obtain high scores, but actually perform unstably and unreliably when dealing with simple tasks. - The paper attempts to solve this problem by introducing GAOKAO - Eval, a comprehensive benchmark test framework based on China's College Entrance Examination (Gaokao). 2. **Inconsistency between high scores and actual capabilities**: - Even after solving the problems of data leakage and insufficient benchmark test coverage, GAOKAO - Eval still reveals that high scores do not truly reflect the consistency between LLMs and human capabilities. For example, LLMs may perform well on complex problems, but often struggle with simple problems. - The paper analyzes the scoring patterns of LLMs by introducing the Rasch model in cognitive psychology and identifies two key differences: (1) abnormally consistent performance on problems of different difficulties; (2) high - variance performance on problems of similar difficulties. 3. **Scoring inconsistency and error patterns**: - The paper finds that there is inconsistency in teachers' scoring of answers generated by LLMs, and LLMs show repeated error patterns in different types of tasks. These phenomena indicate that even if LLMs obtain high scores, it cannot be guaranteed that they have capabilities that match humans. 4. **Improving evaluation methods**: - The paper proposes using reasoning difficulty as a proxy to adjust the difficulty assessment of LLMs to alleviate the above - mentioned inconsistencies. In addition, the design of GAOKAO - Eval ensures the safety and transparency of the evaluation process, including using a closed - book environment, strict time isolation, and subjective question scoring by experienced College Entrance Examination examiners. ### Summary By constructing GAOKAO - Eval, a comprehensive and safe benchmark test framework, this paper reveals the limitations of current LLM evaluation methods and emphasizes the importance of developing evaluation methods that are more in line with the characteristics of LLMs. The research results show that high scores do not necessarily reflect the real capabilities of LLMs, especially when dealing with tasks that match human capabilities.

GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?

Evaluating the Performance of Large Language Models on GAOKAO Benchmark

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

Efficiently Measuring the Cognitive Ability of LLMs: an Adaptive Testing Perspective

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

Don't Make Your LLM an Evaluation Benchmark Cheater

MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence