Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

Yihe Deng,Paul Mineiro

2024-10-30

Abstract:Mathematical reasoning is a crucial capability for Large Language Models (LLMs), yet generating detailed and accurate reasoning traces remains a significant challenge. This paper introduces a novel approach to produce high-quality reasoning traces for LLM fine-tuning using online learning \textbf{Flows}. Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions through iterative communication. We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time. We directly compare the quality of reasoning traces generated by our method with those produced through direct model inference, demonstrating the effectiveness of our approach in improving LLM performance in mathematical reasoning tasks.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the issue of large language models (LLMs) lacking proficiency in mathematical reasoning, particularly the challenge of generating detailed, accurate, and clear reasoning steps. Despite the existence of numerous datasets containing mathematical problems and their answers, generating high-quality reasoning steps remains a significant challenge. The paper proposes a novel approach—improving LLMs' mathematical reasoning abilities through online multi-agent learning. Specifically, it uses an online Direct Preference Optimization (DPO) learning flow, where solutions are constructed through the collaboration and iterative communication of multiple LLMs, and the model is updated in real-time during the training process. The paper aims to enhance LLMs' performance in mathematical reasoning tasks through this method. Specifically, the paper focuses on how to effectively generate high-quality reasoning traces that can be used for fine-tuning LLMs, thereby improving the model's mathematical reasoning capabilities. Existing methods typically rely on direct model inference or human annotation, but these methods either produce reasoning steps that are too brief or disorganized, or are too costly. Therefore, the proposed method attempts to overcome these issues through a multi-agent learning flow, enabling the model to automatically generate higher-quality reasoning traces, thereby achieving self-improvement.

Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Flow of Reasoning: Efficient Training of LLM Policy with Divergent Thinking

Flow of Reasoning:Training LLMs for Divergent Problem Solving with Minimal Examples

HDFlow: Enhancing LLM Complex Problem-Solving with Hybrid Thinking and Dynamic Workflows

Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

GFlowNet Fine-tuning for Diverse Correct Solutions in Mathematical Reasoning Tasks

DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Proof Flow: Preliminary Study on Generative Flow Network Language Model Tuning for Formal Reasoning

Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization

Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model

Mars-PO: Multi-Agent Reasoning System Preference Optimization

Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation

Democratizing Reasoning Ability: Tailored Learning from Large Language Model

MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller Language Models

Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

Concise and Organized Perception Facilitates Reasoning in Large Language Models

Distilling Mathematical Reasoning Capabilities into Small Language Models