Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Zimu Lu,Aojun Zhou,Ke Wang,Houxing Ren,Weikang Shi,Junting Pan,Mingjie Zhan,Hongsheng Li
2024-07-15
Abstract:Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to understand reasoning errors and output accurate reasoning steps. We apply SCDPO to both code-integrated and chain-of-thought solutions, empirically showing that it consistently improves the performance compared to naive DPO on three different SFT models, including one existing SFT model and two models we finetuned. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of SCDPO at identifying errors in mathematical solutions. We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves high scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the performance of large - language models (LLMs) on mathematical reasoning tasks. Specifically, the paper proposes a new method - Step - Controlled Direct Preference Optimization (SCDPO), aiming to enhance the model's ability to understand mathematical reasoning errors by automatically providing step - by - step error supervision and output more accurate reasoning steps. ### Background and Motivation Although the traditional Direct Preference Optimization (DPO) method can improve the quality of the generated text of LLMs through relative feedback, when dealing with multi - step mathematical reasoning problems, the method of judging the model performance only based on the final answer is relatively rough and may not be sufficient to capture the subtle differences in the reasoning process. In addition, the method of introducing process supervision usually requires a large amount of manually - labeled data, which is costly and difficult to be applied on a large scale. ### Solution To overcome the above challenges, the paper proposes the SCDPO method. Its core idea is to automatically generate training data with step - by - step error labels without adding additional manual labels. The specific implementation steps are as follows: 1. **Initial Model Preparation**: Use the existing problem - solution pair data set to fine - tune the model to obtain an initial model with preliminary mathematical problem - solving ability. 2. **Step - Controlled Data Generation**: - From the problem solutions generated by the initial model, select those correct solutions whose final answers match the real answers. - For each correct solution, starting from its intermediate steps, generate wrong solutions by adjusting the model's hyper - parameters (such as increasing the temperature of the softmax function). - These wrong solutions are the same as the original correct solutions in the parts before the intermediate steps, and may contain errors in the parts after that. 3. **Step - Aware DPO Training**: - Take the correct solutions as preferred samples and the wrong solutions as non - preferred samples, and pair them for DPO training. - During the training process, simultaneously use the naive DPO data based on the final answer judgment and the step - controlled SCDPO data in a mixed manner to optimize the general form and detailed reasoning steps of the solutions. ### Experimental Results The experimental results show that the SCDPO method can effectively improve the mathematical reasoning performance on multiple SFT models of different scales. In particular, when applied to the InternLM2 - 20B model, SCDPO enables the model to achieve high scores of 88.5% and 58.1% on the GSM8K and MATH data sets respectively, which is better than other open - source models, showing the great potential of this method. ### Conclusion By introducing the step - controlled data generation and training method, SCDPO not only improves the accuracy of the model on mathematical reasoning tasks, but also provides an efficient and low - cost solution for automatically generating high - quality training data.