ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

Renat Aksitov,Sobhan Miryoosefi,Zonglin Li,Daliang Li,Sheila Babayan,Kavya Kopparapu,Zachary Fisher,Ruiqi Guo,Sushant Prakash,Pranesh Srinivasan,Manzil Zaheer,Felix Yu,Sanjiv Kumar
2023-12-16
Abstract:Answering complex natural language questions often necessitates multi-step reasoning and integrating external information. Several systems have combined knowledge retrieval with a large language model (LLM) to answer such questions. These systems, however, suffer from various failure cases, and we cannot directly train them end-to-end to fix such failures, as interaction with external knowledge is non-differentiable. To address these deficiencies, we define a ReAct-style LLM agent with the ability to reason and act upon external knowledge. We further refine the agent through a ReST-like method that iteratively trains on previous trajectories, employing growing-batch reinforcement learning with AI feedback for continuous self-improvement and self-distillation. Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model that achieves comparable performance on challenging compositional question-answering benchmarks with two orders of magnitude fewer parameters.
Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: How to improve the performance and robustness of multi - step reasoning large language model (LLM) agents when answering complex natural language questions. Specifically, the author focuses on: 1. **Multi - step Reasoning and Integration of External Knowledge**: Many complex questions require multi - step reasoning and combination with external information to be answered. Although existing systems combine knowledge retrieval with large language models to answer such questions, there are various failure cases, and because the process of interacting with external knowledge is non - differentiable, end - to - end training cannot be carried out directly. 2. **Lack of High - Quality Multi - step Labeled Data**: For process - supervision - based systems, obtaining high - quality multi - step labeled data is very difficult and expensive, which limits the improvement of the model. To solve these problems, the author proposes a method that combines the ReAct - style reasoning mechanism and the ReST - style iterative self - training method, which is achieved in the following ways: - Define a ReAct - style agent with self - critical ability, which can perform multi - step reasoning and take actions based on external knowledge. - Adopt the ReST - style method, through iterative training of previous trajectories, using gradually increasing batches of reinforcement learning and AI feedback, to achieve continuous self - improvement and self - distillation. - Starting from a pre - trained large model, after only two algorithm iterations, generate a small model with two orders of magnitude fewer parameters, whose performance in complex combinatorial question - answering benchmarks is comparable to that of the large model. Through these methods, the author aims to improve the ability of multi - step reasoning LLM agents in handling complex problems and reduce the dependence on manually labeled data.