Abstract:Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate their collaborative abilities. However, these benchmarks lack fine-grained evaluations of LLM collaborative capabilities. Additionally, multi-agent collaborative and competitive scenarios are ignored in existing works. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. We conducted extensive evaluations on leading four closed-source and seven open-source models. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that the existing benchmarks lack nuanced evaluation criteria when assessing the collaboration and competition capabilities of large - language models (LLMs). Specifically: 1. **Limitations of Existing Benchmarks**: - The existing benchmarks have roughly evaluated the collaboration capabilities of language models in multi - agent systems, but lack fine - grained evaluation. - The collaboration and competition scenarios in multi - agent systems are ignored in the existing work. 2. **The Proposed New Benchmark**: - To solve the above problems, the author proposes a new benchmarking tool - **BattleAgentBench**. This benchmark aims to comprehensively examine the navigation, collaboration, and competition capabilities of language models in single - agent, two - agent, and multi - agent environments through evaluations in seven different difficulty levels. 3. **Evaluation Framework**: - **Level 1**: Basic agent capability evaluation, mainly testing the basic capabilities of a single agent and the navigation ability in simple scenarios. - **Level 2**: Two - agent interaction evaluation, testing the collaboration and competition capabilities between two agents. - **Level 3**: Multi - agent dynamic evaluation, testing the collaboration and competition capabilities of multiple agents in complex scenarios. 4. **Experimental Results**: - Through extensive evaluations of four closed - source and seven open - source models, the experimental results show that API - based models perform well on simple tasks, but still have a great deal of room for improvement in complex tasks that require collaboration and competition capabilities. Small open - source models perform poorly on simple tasks and even worse on complex tasks. ### Formula Summary - **Forward Distance (F Dis)**: \[ F\,Dis = L_1(p_s - p_{target}) - L_1(p_e - p_{target}) \] where \(L_1\) represents the L1 distance, \(p_s\) is the initial position of the tank, \(p_{target}\) is the target position, and \(p_e\) is the position of the tank at the end of the game. - **Format Accuracy (F Acc)**: \[ F\,Acc = \frac{N_{format}}{N_{total}} \] where \(N_{format}\) represents the number of rounds with correct output format, and \(N_{total}\) represents the total number of rounds. - **Movement Accuracy (M Acc)**: \[ M\,Acc = \frac{N_{correct}}{N_{format}} \] where \(N_{correct}\) represents the number of rounds with correct movement direction. These formulas are used to evaluate the performance of models at different stages, ensuring the accuracy and comparability of the evaluation.

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

AgentBench: Evaluating LLMs as Agents

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

COMMA: A Communicative Multimodal Multi-Agent Benchmark

LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments

Your Co-Workers Matter: Evaluating Collaborative Capabilities of Language Models in Blocks World

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Adaptive In-conversation Team Building for Language Model Agents

CompeteAI: Understanding the Competition Behaviors in Large Language Model-based Agents

CompeteAI: Understanding the Competition Dynamics of Large Language Model-based Agents

CompeteAI: Understanding the Competition Dynamics in Large Language Model-based Agents

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate

LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models

Benchmark Real-time Adaptation and Communication Capabilities of Embodied Agent in Collaborative Scenarios

Large Language Model Evaluation Via Multi AI Agents: Preliminary results

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

SmartPlay: A Benchmark for LLMs as Intelligent Agents

Bench-CoE: a Framework for Collaboration of Experts from Benchmark

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery