GameEval: Evaluating LLMs on Conversational Games

Dan Qiao,Chenfei Wu,Yaobo Liang,Juntao Li,Nan Duan
2023-08-19
Abstract:The rapid advancements in large language models (LLMs) have presented challenges in evaluating those models. Existing evaluation methods are either reference-based or preference based, which inevitably need human intervention or introduce test bias caused by evaluator models. In this paper, we propose GameEval, a novel approach to evaluating LLMs through goal-driven conversational games, overcoming the limitations of previous methods. GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms, including discussion, question answering, and voting. We design three unique games with cooperative or adversarial objectives, accompanied by corresponding evaluation metrics, to show how this new paradigm comprehensively evaluates model performance.Through extensive experiments, we show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems. Our public anonymous code is available at <a class="link-external link-https" href="https://github.com/GameEval/GameEval" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper aims to address existing challenges in the evaluation of large language models (LLMs). Specifically, current evaluation methods are mainly divided into two categories: reference-based methods and preference-based methods. Reference-based methods require a standard answer, which is difficult to achieve in complex scenarios or tasks with multiple solutions, as high-quality annotation is costly and time-consuming. On the other hand, preference-based methods can evaluate the model's performance in real-world scenarios but require a large amount of human resources or introduce bias from the evaluators' preferences. The paper proposes a new paradigm called **GameEval** to evaluate LLMs through goal-driven dialogue games. This method treats LLMs as players and assigns them different roles and specific objectives. The games include forms such as discussions, Q&A, and voting, and three unique games with cooperative or adversarial goals are designed, along with corresponding evaluation metrics. Unlike traditional methods, GameEval does not rely on reference answers or human preferences, thus eliminating testing bias and dependence on standard labels. Moreover, this method requires the model to simultaneously utilize multiple abilities in each round of dialogue to achieve long-term goals, providing a comprehensive assessment of the model's overall capabilities. Through extensive experiments, the paper demonstrates that GameEval can effectively distinguish the capabilities of different LLMs, especially in solving complex problems. For example, when comparing ChatGPT and GPT-4, GameEval can significantly differentiate their performance. The main contribution of the paper is the proposal of a new evaluation framework that can comprehensively assess the overall capabilities of LLMs and reduce various biases in the evaluation process.