Learning to Reason via Self-Iterative Process Feedback for Small Language Models

Kaiyuan Chen,Jin Wang,Xuejie Zhang

2024-12-11

Abstract:Small language models (SLMs) are more efficient, cost-effective, and customizable than large language models (LLMs), though they often underperform in specific areas like reasoning. Past methods for enhancing SLMs' reasoning, such as supervised fine-tuning and distillation, often depend on costly external signals, resulting in SLMs being overly confident with limited supervision signals, thus limiting their abilities. Therefore, this study enables SLMs to learn to reason from self-iterative feedback. By combining odds ratio preference optimization (ORPO), we fine-tune and align SLMs using positive and negative signals generated by themselves. Additionally, we introduce process supervision for rewards in preference alignment by sampling-based inference simulation and process reward models. Compared to Supervised Fine-Tuning (SFT), our method improves the performance of Gemma-2B by 12.43 (Acc) on GSM8K and 3.95 (Pass@1) on MBPP. Furthermore, the proposed method also demonstrated superior out-of-domain generalization capabilities on MMLU_Math and HumanEval.

Computation and Language

What problem does this paper attempt to address?

This paper attempts to address the deficiency in the reasoning ability of small - scale language models (SLMs). Although small - scale language models are more efficient, less costly and easier to customize than large - scale language models (LLMs), they generally perform poorly in specific areas such as reasoning. Previous methods for improving the reasoning ability of SLMs, such as supervised fine - tuning and knowledge distillation, often rely on expensive external signals, causing SLMs to become over - confident with limited supervision signals, thus limiting their capabilities. Therefore, this paper proposes a method for SLMs to learn reasoning through self - iterative process feedback (SIPF). Specifically, by combining odds ratio preference optimization (ORPO), SLMs are fine - tuned and aligned using positive and negative signals generated by the model itself, and process supervision is introduced through sampling - based reasoning simulation and process reward models to obtain rewards. Experimental results show that, compared with supervised fine - tuning (SFT), this method improves the accuracy of the Gemma - 2B model by 12.43% on the GSM8K dataset and improves the Pass@1 metric by 3.95% on the MBPP dataset. In addition, the proposed method also shows superior performance on out - of - domain generalization tasks such as MMLU_Math and HumanEval.

Learning to Reason via Self-Iterative Process Feedback for Small Language Models

Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Optimizing Language Model's Reasoning Abilities with Weak Supervision

Large Language Models Can Self-Improve in Long-context Reasoning

GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Enhancing Language Model Reasoning via Weighted Reasoning in Self-Consistency

Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

SMART: Self-learning Meta-strategy Agent for Reasoning Tasks

Recursive Introspection: Teaching Language Model Agents How to Self-Improve

Improving Language Model Reasoning with Self-motivated Learning

Rational Metareasoning for Large Language Models

Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models