Learning to Reason via Self-Iterative Process Feedback for Small Language Models

Kaiyuan Chen,Jin Wang,Xuejie Zhang
2024-12-11
Abstract:Small language models (SLMs) are more efficient, cost-effective, and customizable than large language models (LLMs), though they often underperform in specific areas like reasoning. Past methods for enhancing SLMs' reasoning, such as supervised fine-tuning and distillation, often depend on costly external signals, resulting in SLMs being overly confident with limited supervision signals, thus limiting their abilities. Therefore, this study enables SLMs to learn to reason from self-iterative feedback. By combining odds ratio preference optimization (ORPO), we fine-tune and align SLMs using positive and negative signals generated by themselves. Additionally, we introduce process supervision for rewards in preference alignment by sampling-based inference simulation and process reward models. Compared to Supervised Fine-Tuning (SFT), our method improves the performance of Gemma-2B by 12.43 (Acc) on GSM8K and 3.95 (Pass@1) on MBPP. Furthermore, the proposed method also demonstrated superior out-of-domain generalization capabilities on MMLU_Math and HumanEval.
Computation and Language
What problem does this paper attempt to address?
This paper attempts to address the deficiency in the reasoning ability of small - scale language models (SLMs). Although small - scale language models are more efficient, less costly and easier to customize than large - scale language models (LLMs), they generally perform poorly in specific areas such as reasoning. Previous methods for improving the reasoning ability of SLMs, such as supervised fine - tuning and knowledge distillation, often rely on expensive external signals, causing SLMs to become over - confident with limited supervision signals, thus limiting their capabilities. Therefore, this paper proposes a method for SLMs to learn reasoning through self - iterative process feedback (SIPF). Specifically, by combining odds ratio preference optimization (ORPO), SLMs are fine - tuned and aligned using positive and negative signals generated by the model itself, and process supervision is introduced through sampling - based reasoning simulation and process reward models to obtain rewards. Experimental results show that, compared with supervised fine - tuning (SFT), this method improves the accuracy of the Gemma - 2B model by 12.43% on the GSM8K dataset and improves the Pass@1 metric by 3.95% on the MBPP dataset. In addition, the proposed method also shows superior performance on out - of - domain generalization tasks such as MMLU_Math and HumanEval.