Abstract:In reasoning tasks, even a minor error can cascade into inaccurate results, leading to suboptimal performance of large language models in such domains. Earlier fine-tuning approaches sought to mitigate this by leveraging more precise supervisory signals from human labeling, larger models, or self-sampling, although at a high cost. Conversely, we develop a method that avoids external resources, relying instead on introducing perturbations to the input. Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks. When applied to fine-tuning with GSM8K on Llama-2-7B, this method achieved a 5\% improvement in GSM8K accuracy and a 10\% improvement in GSM-IC accuracy over standard supervised fine-tuning with a few codes modified. Furthermore, it is complementary to existing methods. When integrated with related explicit data augmentation methods, it leads to improvements across five datasets of various augmentation methods, as well as two different base models. We further investigate the mechanisms behind this improvement through case studies and quantitative analysis, suggesting that our approach may provide superior support for the model in capturing long-distance dependencies, especially those related to questions. This enhancement could deepen understanding of the premises in questions and prior steps. Our code is available at Github.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily aims to address the performance issues of large language models (LLMs) in multi-step reasoning tasks. Specifically: 1. **Error Cascade Problem**: Even small errors can lead to issues in the entire solution process, thereby affecting the accuracy of the final result. 2. **Hallucination Problem**: State-of-the-art models are prone to hallucinations during reasoning, which can lead to errors. 3. **Cost of Supervision Signals**: Previous methods often rely on human annotations, larger models, or self-sampling to obtain more accurate supervision signals, but these methods are costly. ### Solution The paper proposes a simple and effective method—**Masked Thought Fine-Tuning (MFT)**. This method introduces noise by randomly masking certain tokens during the reasoning steps. Experiments show that this method can significantly improve the performance of models in reasoning tasks and has the following characteristics: 1. **Simplicity**: Easy to implement, requiring only the replacement of specific tokens in the reasoning chain. 2. **Effectiveness**: Achieves significant performance improvements across multiple datasets. 3. **Complementarity**: Complementary to existing data augmentation techniques, further enhancing model performance. ### Main Contributions 1. **Proposing the MFT Method**: Improves the reasoning ability of language models by randomly masking certain tokens during the reasoning steps. 2. **Analyzing the Method's Effectiveness**: Analyzes the MFT method from a regularization perspective and proposes two guiding principles. 3. **Enhancing Dependency**: Through quantitative analysis and case studies, it is found that the MFT method enhances dependency on the initial mathematical problem and early steps, thereby reducing the risk of misunderstanding and reasoning inconsistencies. ### Experimental Results The paper validates the effectiveness of the MFT method through various datasets and models, demonstrating its generalization ability and sample efficiency across different tasks. Particularly on smaller datasets, MFT shows significant performance improvements. Additionally, compared to other regularization techniques, MFT is more effective in introducing noise.

Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models

Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions

Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought Processes

MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time

Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards

On the Impact of Fine-Tuning on Chain-of-Thought Reasoning

Learning to Reason via Self-Iterative Process Feedback for Small Language Models

Distilling Reasoning Ability from Large Language Models with Adaptive Thinking

MinT: Boosting Generalization in Mathematical Reasoning via Multi-View Fine-Tuning

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

ReFT: Reasoning with Reinforced Fine-Tuning

Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models

Democratizing Reasoning Ability: Tailored Learning from Large Language Model

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model

Evaluating Mathematical Reasoning Beyond Accuracy

Enhancing Mathematical Reasoning in LLMs by Stepwise Correction