Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

Yihe Deng,Paul Mineiro
2024-10-30
Abstract:Mathematical reasoning is a crucial capability for Large Language Models (LLMs), yet generating detailed and accurate reasoning traces remains a significant challenge. This paper introduces a novel approach to produce high-quality reasoning traces for LLM fine-tuning using online learning \textbf{Flows}. Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions through iterative communication. We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time. We directly compare the quality of reasoning traces generated by our method with those produced through direct model inference, demonstrating the effectiveness of our approach in improving LLM performance in mathematical reasoning tasks.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of large language models (LLMs) lacking proficiency in mathematical reasoning, particularly the challenge of generating detailed, accurate, and clear reasoning steps. Despite the existence of numerous datasets containing mathematical problems and their answers, generating high-quality reasoning steps remains a significant challenge. The paper proposes a novel approach—improving LLMs' mathematical reasoning abilities through online multi-agent learning. Specifically, it uses an online Direct Preference Optimization (DPO) learning flow, where solutions are constructed through the collaboration and iterative communication of multiple LLMs, and the model is updated in real-time during the training process. The paper aims to enhance LLMs' performance in mathematical reasoning tasks through this method. Specifically, the paper focuses on how to effectively generate high-quality reasoning traces that can be used for fine-tuning LLMs, thereby improving the model's mathematical reasoning capabilities. Existing methods typically rely on direct model inference or human annotation, but these methods either produce reasoning steps that are too brief or disorganized, or are too costly. Therefore, the proposed method attempts to overcome these issues through a multi-agent learning flow, enabling the model to automatically generate higher-quality reasoning traces, thereby achieving self-improvement.