Abstract:Large Language Models (LLMs) have exhibited strong mathematical reasoning and computational prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle errors, such as miscalculations or incorrect substitutions, limit the models' full mathematical potential. Existing studies to improve mathematical ability typically involve distilling reasoning skills from stronger LLMs or applying preference learning to step-wise response pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook the frequently occurring subtle errors. A major reason is that sampled preference pairs involve differences unrelated to the errors, which may distract the model from focusing on subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation. In detail, RISE uses the model itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective to focus on predefined errors and their tokens, without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.

What problem does this paper attempt to address?

This paper attempts to address the problem of frequent minor errors in large - language models (LLMs) when performing mathematical reasoning. Although these models have demonstrated strong mathematical reasoning and computational capabilities, being able to handle tasks ranging from basic arithmetic to advanced competition - level tasks, minor errors such as miscalculations, symbol substitution errors, or omission of calculation terms limit the models from fully realizing their mathematical potential. Existing research mainly improves mathematical ability by distilling reasoning skills from stronger LLMs or applying preference learning, but these methods often overlook minor errors because sampling preferences may contain differences unrelated to errors, which may cause the model to be unable to focus on the correction of minor errors. To this end, the authors propose a new preference - learning framework called Random - Injected Self - Editing (RISE). This framework constructs difficult pairs for error mitigation by injecting predefined minor errors into some tokens of the correct solution. Specifically, RISE uses a small number of tokens in the model's self - edited solution to inject designed minor errors. Then, pairs consisting of self - edited solutions and their corresponding correct solutions, as well as correct and incorrect solution pairs obtained by sampling, are jointly used for direct preference optimization (DPO) training with minor - error awareness. Compared with other preference - learning methods, RISE further refines the training objective, focusing on predefined errors and their tokens, without the need for fine - grained sampling or preference annotation. Through extensive experimental verification, RISE has achieved significant improvements in preference learning on Qwen2 - 7B - Instruct, increasing the accuracy rate by 3.0% on the GSM8K dataset and 7.9% on the MATH dataset. This indicates that the RISE framework helps to improve the mathematical reasoning ability of LLMs, especially in reducing minor errors.

Subtle Errors Matter: Preference Learning via Error-injected Self-editing

Learning to Reason via Self-Iterative Process Feedback for Small Language Models

Recursive Introspection: Teaching Language Model Agents How to Self-Improve

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Learning From Mistakes Makes LLM Better Reasoner

Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models

Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint

Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought

SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights

Building Math Agents with Multi-Turn Iterative Preference Learning

Course-Correction: Safety Alignment Using Synthetic Preferences

Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes

Preference Optimization for Reasoning with Pseudo Feedback

LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

Can LLMs Learn from Previous Mistakes? Investigating LLMs' Errors to Boost for Reasoning

Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Enhancing Multi-hop Reasoning through Knowledge Erasure in Large Language Model Editing

Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks