BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

Wei Wang,Dan Zhang,Tao Feng,Boyan Wang,Jie Tang
2024-08-29
Abstract:Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate their collaborative abilities. However, these benchmarks lack fine-grained evaluations of LLM collaborative capabilities. Additionally, multi-agent collaborative and competitive scenarios are ignored in existing works. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. We conducted extensive evaluations on leading four closed-source and seven open-source models. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that the existing benchmarks lack nuanced evaluation criteria when assessing the collaboration and competition capabilities of large - language models (LLMs). Specifically: 1. **Limitations of Existing Benchmarks**: - The existing benchmarks have roughly evaluated the collaboration capabilities of language models in multi - agent systems, but lack fine - grained evaluation. - The collaboration and competition scenarios in multi - agent systems are ignored in the existing work. 2. **The Proposed New Benchmark**: - To solve the above problems, the author proposes a new benchmarking tool - **BattleAgentBench**. This benchmark aims to comprehensively examine the navigation, collaboration, and competition capabilities of language models in single - agent, two - agent, and multi - agent environments through evaluations in seven different difficulty levels. 3. **Evaluation Framework**: - **Level 1**: Basic agent capability evaluation, mainly testing the basic capabilities of a single agent and the navigation ability in simple scenarios. - **Level 2**: Two - agent interaction evaluation, testing the collaboration and competition capabilities between two agents. - **Level 3**: Multi - agent dynamic evaluation, testing the collaboration and competition capabilities of multiple agents in complex scenarios. 4. **Experimental Results**: - Through extensive evaluations of four closed - source and seven open - source models, the experimental results show that API - based models perform well on simple tasks, but still have a great deal of room for improvement in complex tasks that require collaboration and competition capabilities. Small open - source models perform poorly on simple tasks and even worse on complex tasks. ### Formula Summary - **Forward Distance (F Dis)**: \[ F\,Dis = L_1(p_s - p_{target}) - L_1(p_e - p_{target}) \] where \(L_1\) represents the L1 distance, \(p_s\) is the initial position of the tank, \(p_{target}\) is the target position, and \(p_e\) is the position of the tank at the end of the game. - **Format Accuracy (F Acc)**: \[ F\,Acc = \frac{N_{format}}{N_{total}} \] where \(N_{format}\) represents the number of rounds with correct output format, and \(N_{total}\) represents the total number of rounds. - **Movement Accuracy (M Acc)**: \[ M\,Acc = \frac{N_{correct}}{N_{format}} \] where \(N_{correct}\) represents the number of rounds with correct movement direction. These formulas are used to evaluate the performance of models at different stages, ensuring the accuracy and comparability of the evaluation.