Abstract:Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to understand reasoning errors and output accurate reasoning steps. We apply SCDPO to both code-integrated and chain-of-thought solutions, empirically showing that it consistently improves the performance compared to naive DPO on three different SFT models, including one existing SFT model and two models we finetuned. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of SCDPO at identifying errors in mathematical solutions. We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves high scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the performance of large - language models (LLMs) on mathematical reasoning tasks. Specifically, the paper proposes a new method - Step - Controlled Direct Preference Optimization (SCDPO), aiming to enhance the model's ability to understand mathematical reasoning errors by automatically providing step - by - step error supervision and output more accurate reasoning steps. ### Background and Motivation Although the traditional Direct Preference Optimization (DPO) method can improve the quality of the generated text of LLMs through relative feedback, when dealing with multi - step mathematical reasoning problems, the method of judging the model performance only based on the final answer is relatively rough and may not be sufficient to capture the subtle differences in the reasoning process. In addition, the method of introducing process supervision usually requires a large amount of manually - labeled data, which is costly and difficult to be applied on a large scale. ### Solution To overcome the above challenges, the paper proposes the SCDPO method. Its core idea is to automatically generate training data with step - by - step error labels without adding additional manual labels. The specific implementation steps are as follows: 1. **Initial Model Preparation**: Use the existing problem - solution pair data set to fine - tune the model to obtain an initial model with preliminary mathematical problem - solving ability. 2. **Step - Controlled Data Generation**: - From the problem solutions generated by the initial model, select those correct solutions whose final answers match the real answers. - For each correct solution, starting from its intermediate steps, generate wrong solutions by adjusting the model's hyper - parameters (such as increasing the temperature of the softmax function). - These wrong solutions are the same as the original correct solutions in the parts before the intermediate steps, and may contain errors in the parts after that. 3. **Step - Aware DPO Training**: - Take the correct solutions as preferred samples and the wrong solutions as non - preferred samples, and pair them for DPO training. - During the training process, simultaneously use the naive DPO data based on the final answer judgment and the step - controlled SCDPO data in a mixed manner to optimize the general form and detailed reasoning steps of the solutions. ### Experimental Results The experimental results show that the SCDPO method can effectively improve the mathematical reasoning performance on multiple SFT models of different scales. In particular, when applied to the InternLM2 - 20B model, SCDPO enables the model to achieve high scores of 88.5% and 58.1% on the GSM8K and MATH data sets respectively, which is better than other open - source models, showing the great potential of this method. ### Conclusion By introducing the step - controlled data generation and training method, SCDPO not only improves the accuracy of the model on mathematical reasoning tasks, but also provides an efficient and low - cost solution for automatically generating high - quality training data.

Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Step-level Value Preference Optimization for Mathematical Reasoning

Achieving >97 Better Solvers for Math Word Problems

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems

TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

Enhancing Mathematical Reasoning in LLMs by Stepwise Correction

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization

DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller Language Models

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model

DOP: Diagnostic-Oriented Prompting for Large Language Models in Mathematical Correction

Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization

Preference Optimization for Reasoning with Pseudo Feedback

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Iterative Reasoning Preference Optimization

Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning