Subtle Errors Matter: Preference Learning via Error-injected Self-editing

Kaishuai Xu,Tiezheng Yu,Wenjun Hou,Yi Cheng,Chak Tou Leong,Liangyou Li,Xin Jiang,Lifeng Shang,Qun Liu,Wenjie Li
2024-10-09
Abstract:Large Language Models (LLMs) have exhibited strong mathematical reasoning and computational prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle errors, such as miscalculations or incorrect substitutions, limit the models' full mathematical potential. Existing studies to improve mathematical ability typically involve distilling reasoning skills from stronger LLMs or applying preference learning to step-wise response pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook the frequently occurring subtle errors. A major reason is that sampled preference pairs involve differences unrelated to the errors, which may distract the model from focusing on subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation. In detail, RISE uses the model itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective to focus on predefined errors and their tokens, without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to address the problem of frequent minor errors in large - language models (LLMs) when performing mathematical reasoning. Although these models have demonstrated strong mathematical reasoning and computational capabilities, being able to handle tasks ranging from basic arithmetic to advanced competition - level tasks, minor errors such as miscalculations, symbol substitution errors, or omission of calculation terms limit the models from fully realizing their mathematical potential. Existing research mainly improves mathematical ability by distilling reasoning skills from stronger LLMs or applying preference learning, but these methods often overlook minor errors because sampling preferences may contain differences unrelated to errors, which may cause the model to be unable to focus on the correction of minor errors. To this end, the authors propose a new preference - learning framework called Random - Injected Self - Editing (RISE). This framework constructs difficult pairs for error mitigation by injecting predefined minor errors into some tokens of the correct solution. Specifically, RISE uses a small number of tokens in the model's self - edited solution to inject designed minor errors. Then, pairs consisting of self - edited solutions and their corresponding correct solutions, as well as correct and incorrect solution pairs obtained by sampling, are jointly used for direct preference optimization (DPO) training with minor - error awareness. Compared with other preference - learning methods, RISE further refines the training objective, focusing on predefined errors and their tokens, without the need for fine - grained sampling or preference annotation. Through extensive experimental verification, RISE has achieved significant improvements in preference learning on Qwen2 - 7B - Instruct, increasing the accuracy rate by 3.0% on the GSM8K dataset and 7.9% on the MATH dataset. This indicates that the RISE framework helps to improve the mathematical reasoning ability of LLMs, especially in reducing minor errors.