Abstract:Large Language Models' (LLM) reasoning can be improved using test-time aggregation strategies, i.e., generating multiple samples and voting among generated samples. While these improve performance, they often reach a saturation point. Refinement offers an alternative by using LLM-generated feedback to improve solution quality. However, refinement introduces 3 key challenges: (1) Excessive refinement: Uniformly refining all instances can over-correct and reduce the overall performance. (2) Inability to localize and address errors: LLMs have a limited ability to self-correct and struggle to identify and correct their own mistakes. (3) Insufficient refinement: Deciding how many iterations of refinement are needed is non-trivial, and stopping too soon could leave errors unaddressed. To tackle these issues, we propose MAgICoRe, which avoids excessive refinement by categorizing problem difficulty as easy or hard, solving easy problems with coarse-grained aggregation and hard ones with fine-grained and iterative multi-agent refinement. To improve error localization, we incorporate external step-wise reward model (RM) scores. Moreover, to ensure effective refinement, we employ a multi-agent loop with three agents: Solver, Reviewer (which generates targeted feedback based on step-wise RM scores), and the Refiner (which incorporates feedback). To ensure sufficient refinement, we re-evaluate updated solutions, iteratively initiating further rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5 and show its effectiveness across 5 math datasets. Even one iteration of MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by 4.0% while using less than half the samples. Unlike iterative refinement with baselines, MAgICoRe continues to improve with more iterations. Finally, our ablations highlight the importance of MAgICoRe's RMs and multi-agent communication.

ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents

Towards Reasoning in Large Language Models via Multi-Agent Peer Review Collaboration

MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning

MALT: Improving Reasoning with Multi-Agent LLM Training

Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate

Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks

Corex: Pushing the Boundaries of Complex Reasoning through Multi-Model Collaboration

Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

LLM Harmony: Multi-Agent Communication for Problem Solving

Refining LLMs Outputs with Iterative Consensus Ensemble (ICE)

Enhancing Language Model Reasoning via Weighted Reasoning in Self-Consistency

Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models

Reasoning in Conversation: Solving Subjective Tasks through Dialogue Simulation for Large Language Models

Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

ReAct: Synergizing Reasoning and Acting in Language Models

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

SocraSynth: Multi-LLM Reasoning with Conditional Statistics