Enhancing LLM Evaluations: The Garbling Trick

William F. Bradley
2024-11-05
Abstract:As large language models (LLMs) become increasingly powerful, traditional evaluation metrics tend to saturate, making it challenging to distinguish between models based on their performance. We propose a general method to transform existing LLM evaluations into a series of progressively more difficult tasks. These enhanced evaluations emphasize reasoning capabilities and can reveal relative performance differences that are not apparent in the original assessments. To demonstrate the effectiveness of our approach, we create a new multiple-choice test corpus, extend it into a family of evaluations, and assess a collection of LLMs. Our results offer insights into the comparative reasoning abilities of these models, particularly highlighting distinctions between OpenAI's o1-preview and Google's gemini-pro-1.5-002.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: As the performance of large - language models (LLMs) improves, traditional evaluation metrics tend to saturate and it is difficult to distinguish the performance differences between different models. Specifically, existing evaluation methods have already achieved high accuracy on certain tasks (such as an MMLU score of 92.3% and a HellaSwag score of 96.1%, etc.), which makes these evaluation methods no longer effectively distinguish the superiority and inferiority of models. To meet this challenge, the author proposes a general method to transform existing LLM evaluations into a series of progressively more difficult tasks. Through this method, the relative performance differences of models in terms of reasoning ability can be better revealed, especially their performance on high - difficulty tasks. ### Main contributions of the paper 1. **Propose "The Garbling Trick"**: - By randomly garbling text content, observe the impact of different garbling rates on model performance. - This method generates a score curve \( s(p) \), where \( p \) is the garbling probability, ranging from \( p = 0 \) (not garbled) to \( p = 1 \) (completely garbled). 2. **Create a new evaluation framework**: - Construct a new evaluation dataset named "NeoSQuAD", which contains 10,000 multiple - choice questions. - Expand this dataset through the garbling trick to generate a series of more challenging evaluation tasks. 3. **Analyze model performance**: - Evaluate nine different LLMs and generate their score curves. - Discover some interesting patterns. For example, at a low garbling rate, the performance of some small models is similar to that of large models, but at a high garbling rate, OpenAI's o1 - preview model shows stronger reasoning ability. ### Key points of the solution - **Impact of the garbling rate**: As the garbling rate increases, the model needs to reason on the basis of incomplete or uncertain information, thereby revealing its true reasoning ability. - **Contextual Core**: By restricting the evaluation to those questions that truly rely on context, prevent performance saturation and increase the difficulty and discrimination of the evaluation. - **Multi - model comparison**: Through the different shapes of the score curves, the performance of different models on various difficulty tasks can be more clearly compared. In short, this paper aims to overcome the limitations of existing evaluation metrics by introducing a new evaluation method, so as to more comprehensively evaluate and compare the reasoning ability of large - language models.