Teaching Language Models to Self-Improve through Interactive Demonstrations

Xiao Yu,Baolin Peng,Michel Galley,Jianfeng Gao,Zhou Yu
2024-04-01
Abstract:The self-improving ability of large language models (LLMs), enabled by prompting them to analyze and revise their own outputs, has garnered significant interest in recent research. However, this ability has been shown to be absent and difficult to learn for smaller models, thus widening the performance gap between state-of-the-art LLMs and more cost-effective and faster ones. To reduce this gap, we introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability, and show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%. In contrast to prior work, we achieve this by using the smaller model to interact with LLMs to collect feedback and improvements on its own generations. We then replay this experience to train the small model. Our experiments on four math and reasoning datasets show that the interactive experience of learning from and correcting its own mistakes is crucial for small models to improve their performance.
Computer Science
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to enable smaller language models to acquire the ability of self - improvement, especially in mathematical and logical reasoning tasks. Although large language models (LLMs) have demonstrated strong self - improvement abilities through self - analysis and revising their outputs, this ability is difficult to achieve in small models, resulting in a significant performance gap. To address this issue, the paper introduces a training algorithm named TRIPOST, which aims to collect feedback and improvements by interacting with large language models, and then replay these experiences to train small models, thereby endowing small models with the ability of self - improvement. Experimental results show that the performance of small models trained with TRIPOST in mathematical and logical reasoning tasks has been significantly improved, up to 7.13%. In addition, the paper also explores different configurations of TRIPOST and their impacts on the model's self - improvement ability and task performance.