Abstract:The rapid advancements in large language models (LLMs) have presented challenges in evaluating those models. Existing evaluation methods are either reference-based or preference based, which inevitably need human intervention or introduce test bias caused by evaluator models. In this paper, we propose GameEval, a novel approach to evaluating LLMs through goal-driven conversational games, overcoming the limitations of previous methods. GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms, including discussion, question answering, and voting. We design three unique games with cooperative or adversarial objectives, accompanied by corresponding evaluation metrics, to show how this new paradigm comprehensively evaluates model performance.Through extensive experiments, we show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems. Our public anonymous code is available at <a class="link-external link-https" href="https://github.com/GameEval/GameEval" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper aims to address existing challenges in the evaluation of large language models (LLMs). Specifically, current evaluation methods are mainly divided into two categories: reference-based methods and preference-based methods. Reference-based methods require a standard answer, which is difficult to achieve in complex scenarios or tasks with multiple solutions, as high-quality annotation is costly and time-consuming. On the other hand, preference-based methods can evaluate the model's performance in real-world scenarios but require a large amount of human resources or introduce bias from the evaluators' preferences. The paper proposes a new paradigm called **GameEval** to evaluate LLMs through goal-driven dialogue games. This method treats LLMs as players and assigns them different roles and specific objectives. The games include forms such as discussions, Q&A, and voting, and three unique games with cooperative or adversarial goals are designed, along with corresponding evaluation metrics. Unlike traditional methods, GameEval does not rely on reference answers or human preferences, thus eliminating testing bias and dependence on standard labels. Moreover, this method requires the model to simultaneously utilize multiple abilities in each round of dialogue to achieve long-term goals, providing a comprehensive assessment of the model's overall capabilities. Through extensive experiments, the paper demonstrates that GameEval can effectively distinguish the capabilities of different LLMs, especially in solving complex problems. For example, when comparing ChatGPT and GPT-4, GameEval can significantly differentiate their performance. The main contribution of the paper is the proposal of a new evaluation framework that can comprehensively assess the overall capabilities of LLMs and reduce various biases in the evaluation process.

GameEval: Evaluating LLMs on Conversational Games

Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

Language Urban Odyssey: A Serious Game for Enhancing Second Language Acquisition Through Large Language Models

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

LLM-Mini-CEX: Automatic Evaluation of Large Language Model for Diagnostic Conversation

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

From Text to Tactic: Evaluating LLMs Playing the Game of Avalon

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Leveraging LLMs for Dialogue Quality Measurement

Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Spoken Language Intelligence of Large Language Models for Language Learning