Abstract:As large language models (LLMs) become increasingly powerful, traditional evaluation metrics tend to saturate, making it challenging to distinguish between models based on their performance. We propose a general method to transform existing LLM evaluations into a series of progressively more difficult tasks. These enhanced evaluations emphasize reasoning capabilities and can reveal relative performance differences that are not apparent in the original assessments. To demonstrate the effectiveness of our approach, we create a new multiple-choice test corpus, extend it into a family of evaluations, and assess a collection of LLMs. Our results offer insights into the comparative reasoning abilities of these models, particularly highlighting distinctions between OpenAI's o1-preview and Google's gemini-pro-1.5-002.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: As the performance of large - language models (LLMs) improves, traditional evaluation metrics tend to saturate and it is difficult to distinguish the performance differences between different models. Specifically, existing evaluation methods have already achieved high accuracy on certain tasks (such as an MMLU score of 92.3% and a HellaSwag score of 96.1%, etc.), which makes these evaluation methods no longer effectively distinguish the superiority and inferiority of models. To meet this challenge, the author proposes a general method to transform existing LLM evaluations into a series of progressively more difficult tasks. Through this method, the relative performance differences of models in terms of reasoning ability can be better revealed, especially their performance on high - difficulty tasks. ### Main contributions of the paper 1. **Propose "The Garbling Trick"**: - By randomly garbling text content, observe the impact of different garbling rates on model performance. - This method generates a score curve \( s(p) \), where \( p \) is the garbling probability, ranging from \( p = 0 \) (not garbled) to \( p = 1 \) (completely garbled). 2. **Create a new evaluation framework**: - Construct a new evaluation dataset named "NeoSQuAD", which contains 10,000 multiple - choice questions. - Expand this dataset through the garbling trick to generate a series of more challenging evaluation tasks. 3. **Analyze model performance**: - Evaluate nine different LLMs and generate their score curves. - Discover some interesting patterns. For example, at a low garbling rate, the performance of some small models is similar to that of large models, but at a high garbling rate, OpenAI's o1 - preview model shows stronger reasoning ability. ### Key points of the solution - **Impact of the garbling rate**: As the garbling rate increases, the model needs to reason on the basis of incomplete or uncertain information, thereby revealing its true reasoning ability. - **Contextual Core**: By restricting the evaluation to those questions that truly rely on context, prevent performance saturation and increase the difficulty and discrimination of the evaluation. - **Multi - model comparison**: Through the different shapes of the score curves, the performance of different models on various difficulty tasks can be more clearly compared. In short, this paper aims to overcome the limitations of existing evaluation metrics by introducing a new evaluation method, so as to more comprehensively evaluate and compare the reasoning ability of large - language models.

Enhancing LLM Evaluations: The Garbling Trick

Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs

Towards Understanding the Robustness of LLM-based Evaluations under Perturbations

State of What Art? A Call for Multi-Prompt LLM Evaluation

Evaluating Large Language Models at Evaluating Instruction Following

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

A Closer Look into Using Large Language Models for Automatic Evaluation

Easy Problems That LLMs Get Wrong

A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Post Turing: Mapping the landscape of LLM Evaluation

A Survey of Useful LLM Evaluation

Competition-Level Problems are Effective LLM Evaluators

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

Can Large Language Models Be an Alternative to Human Evaluations?

GameEval: Evaluating LLMs on Conversational Games