Abstract:This paper explores optimal architectures for evaluating the outputs of large language models (LLMs) using LLMs themselves. We propose a novel framework that interprets LLMs as advocates within an ensemble of interacting agents, allowing them to defend their answers and reach conclusions through a judge and jury system. This approach offers a more dynamic and comprehensive evaluation process compared to traditional human-based assessments or automated metrics. We discuss the motivation behind this framework, its key components, and comparative advantages. We also present a probabilistic model to evaluate the error reduction achieved by iterative advocate systems. Finally, we outline experiments to validate the effectiveness of multi-advocate architectures and discuss future research directions.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the challenges of evaluating the outputs of large - scale language models (LLMs). With the rapid development of LLMs, their capabilities in generating human - like texts, conducting conversations, and performing complex language tasks are becoming stronger. However, it is increasingly crucial to accurately evaluate the performance of these models and align their outputs with human preferences. Traditional evaluation methods such as human evaluation and automated metrics often fail to capture the nuances and complexity of LLM outputs, resulting in a gap between model performance and user expectations. Specifically, the paper attempts to solve the following problems: 1. **Limitations of traditional evaluation methods**: - **Human evaluation**: Time - consuming, expensive, and prone to inconsistency and bias. - **Automated metrics**: Usually not in line with human judgment, especially performing poorly in open - generation tasks. 2. **The need for a more dynamic and comprehensive evaluation framework**: - Existing evaluation methods have difficulty in capturing the subtle differences and complexity in LLM outputs, leading to inaccurate and unreliable evaluation results. 3. **Exploring new evaluation architectures**: - The paper proposes a novel multi - agent framework, regarding LLMs as defense attorneys, judges, and juries in a court - inspired architecture, and evaluating LLM outputs through structured debates, cross - examinations, and fair judgments. ### Solutions To address the above challenges, the paper proposes a framework based on an adversarial multi - agent system. The main contributions include: 1. **Dynamic multi - agent framework**: Using LLMs as interacting defense attorneys, judges, and juries to provide more comprehensive and context - based evaluations. 2. **Court - inspired architecture**: Utilizing structured debates, cross - examinations, and fair judgments to reveal the strengths, weaknesses, and inconsistencies in LLM responses. 3. **Theoretical basis**: Drawing on theories such as bounded rationality, incentive design, persuasion and argumentation theory, and adversarial processes to ensure that the system promotes accurate, unbiased, and reliable evaluations. 4. **Voting theory and social choice principles**: Designing an effective jury system to aggregate the judgments of multiple LLM agents, promoting fair and representative evaluations while reducing the impact of strategic behavior and personal biases. Through these innovations, the paper aims to develop a more efficient and reliable LLM evaluation method, thereby promoting the development of the reliability, transparency, and responsibility of AI systems.

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Evaluating the Performance of Large Language Models via Debates

ACC-Debate: An Actor-Critic Approach to Multi-Agent Debate

Large Language Model Evaluation Via Multi AI Agents: Preliminary results

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation

An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture

Multi-Agent Large Language Models for Conversational Task-Solving

Limits of Large Language Models in Debating Humans

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs

Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate

Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models