Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate

Yiqun Zhang,Xiaocui Yang,Shi Feng,Daling Wang,Yifei Zhang,Kaisong Song

2024-08-20

Abstract:Competitive debate is a complex task of computational argumentation. Large Language Models (LLMs) suffer from hallucinations and lack competitiveness in this field. To address these challenges, we introduce Agent for Debate (Agent4Debate), a dynamic multi-agent framework based on LLMs designed to enhance their capabilities in competitive debate. Drawing inspiration from human behavior in debate preparation and execution, Agent4Debate employs a collaborative architecture where four specialized agents, involving Searcher, Analyzer, Writer, and Reviewer, dynamically interact and cooperate. These agents work throughout the debate process, covering multiple stages from initial research and argument formulation to rebuttal and summary. To comprehensively evaluate framework performance, we construct the Competitive Debate Arena, comprising 66 carefully selected Chinese debate motions. We recruit ten experienced human debaters and collect records of 200 debates involving Agent4Debate, baseline models, and humans. The evaluation employs the Debatrix automatic scoring system and professional human reviewers based on the established Debatrix-Elo and Human-Elo ranking. Experimental results indicate that the state-of-the-art Agent4Debate exhibits capabilities comparable to those of humans. Furthermore, ablation studies demonstrate the effectiveness of each component in the agent structure.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the issues of hallucination and lack of competitiveness faced by large language models (LLMs) in debate tasks. Specifically, the paper proposes a multi-agent framework based on LLMs—Agent4Debate—to enhance the performance of LLMs in competitive debates. By mimicking the collaborative approach of human debate teams, this framework integrates four specialized roles: Searcher, Analyzer, Writer, and Reviewer, which dynamically interact and cooperate throughout the debate process. To comprehensively evaluate the framework's performance, the researchers constructed a "Competitive Debate Arena," which includes 66 carefully selected Chinese debate topics and recruited 10 experienced debaters, collecting records of 200 debates involving Agent4Debate, benchmark models, and human participants. The evaluation employed the Debatrix automatic scoring system and professional human reviewers, scoring based on the established Debatrix-Elo and Human-Elo ranking systems. Experimental results indicate that the state-of-the-art Agent4Debate demonstrates human-comparable capabilities in various types of competitive debates. Additionally, ablation studies show the effectiveness of each component within the framework.

Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate

LLMs as Debate Partners: Utilizing Genetic Algorithms and Adversarial Search for Adaptive Arguments

ACC-Debate: An Actor-Critic Approach to Multi-Agent Debate

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM

Limits of Large Language Models in Debating Humans

GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

Systematic Biases in LLM Simulations of Debates

Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs

On scalable oversight with weak LLMs judging strong LLMs

AgentBench: Evaluating LLMs as Agents

Robot Debater: Debate-styled Text Auto-generation System Based on Large Foundation Language Models

Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

Debating with More Persuasive LLMs Leads to More Truthful Answers

Improving Multi-Agent Debate with Sparse Communication Topology

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Evaluating the Performance of Large Language Models via Debates