clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

Anne Beyer,Kranti Chalamalasetti,Sherzod Hakimov,Brielen Madureira,Philipp Sadler,David Schlangen

2024-05-31

Abstract:It has been established in recent work that Large Language Models (LLMs) can be prompted to "self-play" conversational games that probe certain capabilities (general instruction following, strategic goal orientation, language understanding abilities), where the resulting interactive game play can be automatically scored. In this paper, we take one of the proposed frameworks for setting up such game-play environments, and further test its usefulness as an evaluation instrument, along a number of dimensions: We show that it can easily keep up with new developments while avoiding data contamination, we show that the tests implemented within it are not yet saturated (human performance is substantially higher than that of even the best models), and we show that it lends itself to investigating additional questions, such as the impact of the prompting language on performance. We believe that the approach forms a good basis for making decisions on model choice for building applied interactive systems, and perhaps ultimately setting up a closed-loop development environment of system and simulated evaluator.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

This paper mainly discusses how to use large language models (LLMs) for self-play dialogue games to evaluate their capabilities in multitask and multilingual environments. The author expands the clemgame framework into a dynamic, challenging, and complementary benchmarking tool called clembench, which is used to test the performance of LLMs as multi-turn agents. The study finds that although the performance of these models still needs improvement (currently far below human performance), they can adapt to new developments and avoid data corruption. Additionally, the flexibility of this framework makes it easy to integrate new models and track the performance improvements of open-weight models. The paper also points out that clembench, as an evaluation tool, is not only suitable for selecting models to build interactive systems but can also be used for in-depth research into specific aspects of LLM behavior. It is also able to evaluate the capabilities of models across languages. Finally, the author suggests potential future applications such as interactive learning environments and closed-loop development frameworks to enhance dialogue system design.

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

SmartPlay: A Benchmark for LLMs as Intelligent Agents

GameEval: Evaluating LLMs on Conversational Games

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

From Text to Tactic: Evaluating LLMs Playing the Game of Avalon

AgentBench: Evaluating LLMs as Agents

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation

SimulBench: Evaluating Language Models with Creative Simulation Tasks

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

AvalonBench: Evaluating LLMs Playing the Game of Avalon

LLF-Bench: Benchmark for Interactive Learning from Language Feedback

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

Economics Arena for Large Language Models

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

State of What Art? A Call for Multi-Prompt LLM Evaluation