PaD: Program-aided Distillation Can Teach Small Models Reasoning Better Than Chain-of-thought Fine-tuning

Xuekai Zhu,Biqing Qi,Kaiyan Zhang,Xinwei Long,Zhouhan Lin,Bowen Zhou
DOI: https://doi.org/10.18653/v1/2024.naacl-long.142
2024-01-01
Abstract:While large language models (LLMs) excel in various natural languageprocessing tasks, their huge size and the inaccessibility of parameters presentchallenges for practical deployment. Previous studies try to distilltask-specific ability from LLMs to smaller models, using data synthesis andchain-of-thought (CoT) fine-tuning. However, synthetic CoT data often containsfaulty reasoning, which deteriorates the quality of distillation, especially inreasoning capabilities. In this work, we propose Program-aided Distillation(PaD), which introduces reasoning programs to suppress the errors in distilleddata, and thus achieves better distillation quality for reasoning tasks. InPaD, we utilize the reasoning program to substitute the CoT, allowing automatederror checking of synthetic data. Further, through error injecting and furthertraining, the small distilling model could iteratively self-refine thereasoning. Moreover, we conduct a step-wise beam search by step-by-stepverifying to acquire more exact reasoning chains. We evaluate PaD on arithmeticreasoning, symbolic reasoning, and general ability. Experimental resultsdemonstrate that smaller models using PaD can not only outperform certainLLMs (e.g., LLaMA-1 13B) but also achieve strong improvement over baselineswith a significantly smaller scale of parameters and data. The source code ispublicly available at https://github.com/Xuekai-Zhu/pad.
What problem does this paper attempt to address?