Abstract:As language models (LMs) demonstrate their capabilities in various fields, their application to tasks requiring multi-round interactions has become increasingly popular. These tasks usually have complex dynamics, so supervised fine-tuning (SFT) on a limited offline dataset does not yield good performance. However, only a few works attempted to directly train the LMs within interactive decision-making environments. We aim to create an effective approach to fine-tune LMs with online reinforcement learning (RL) in these environments. We propose Reflect-RL, a two-player system to fine-tune an LM using SFT and online RL, where a frozen reflection model (player) assists the policy model (player). To generate data for the warm-up SFT stage, we use negative example generation to enhance the error-correction ability of the reflection model. Furthermore, we designed single-prompt action enumeration and applied curriculum learning to allow the policy model to learn more efficiently. Empirically, we verify that Reflect-RL outperforms SFT and online RL without reflection. Testing results indicate GPT-2 XL 1.56B fine-tuned with Reflect-RL outperforms larger open-source LMs, such as Mistral 7B. The benchmarks, dataset, and code involved in this work are publicly available: <a class="link-external link-https" href="https://github.com/zhourunlong/Reflect-RL" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively fine - tune language models (LMs) using online reinforcement learning (RL) in multi - round interaction tasks to improve their performance in complex and dynamic environments. Specifically, the paper points out that although language models perform well in various tasks, when applied to tasks requiring multi - round interactions, using only a limited offline data set for supervised fine - tuning (SFT) cannot achieve good performance. In addition, there are relatively few studies on training language models directly in interactive decision - making environments. Therefore, this paper proposes a method named Reflect - RL to effectively fine - tune language models in these environments by combining SFT and online RL. ### Main problems 1. **Performance improvement in multi - round interaction tasks**: Existing supervised fine - tuning methods are not effective in handling tasks requiring multi - round interactions, especially in environments with complex and dynamic characteristics. 2. **Application of online reinforcement learning**: Although online reinforcement learning has been successful in other fields, its application on language models still faces challenges, especially in how to effectively utilize the reasoning and reflection capabilities of language models. 3. **Error correction and adaptive capabilities**: Language models may not be able to self - correct errors without external feedback, while online reinforcement learning can provide the ability of dynamic adaptation and decision - making. ### Solutions To address the above problems, the paper proposes the Reflect - RL method, which mainly includes the following key techniques and stages: 1. **Reflection mechanism**: - Extract reflection capabilities from GPT - 4 through supervised learning and deploy it as a frozen reflection model to assist the trainable policy model in decision - making. - The reflection mechanism accelerates training convergence and improves test performance. 2. **Negative example generation**: - Since most of the data collected from GPT - 4 are positive examples (approximately optimal decisions), in order to balance the data set, negative examples are generated by perturbing GPT - 4's trajectories and optimal trajectories. - Negative examples enhance the quality of reflection and ultimately improve the success rate of benchmark tests. 3. **Single - prompt action enumeration**: - Incorporate all possible valid actions into a single prompt, so that the language model only needs to generate one token to select the appropriate option. - This method improves previous normalization techniques, generates valid actions and reduces time complexity. 4. **Task - specific curriculum learning**: - Through the idea of curriculum learning, design specific curricula to guide training, for example, by providing additional rewards or adjusting the data order to help the model learn more efficiently. ### Experimental verification The paper verifies the effectiveness of Reflect - RL through multiple benchmark tests, including AutoExplore, DangerousTaxi and ALFWorld. The experimental results show that GPT - 2 XL fine - tuned with Reflect - RL significantly outperforms models fine - tuned only with SFT or online RL on multiple tasks, and even surpasses larger - scale open - source language models. ### Summary Reflect - RL effectively solves the problem of performance improvement of language models in multi - round interaction tasks by combining reflection mechanism, negative example generation, single - prompt action enumeration and task - specific curriculum learning, demonstrating the great potential of online reinforcement learning in fine - tuning language models.

Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

AdaRefiner: Refining Decisions of Language Models with Adaptive Feedback

Reinforcement Learning from Reflective Feedback (RLRF): Aligning and Improving LLMs via Fine-Grained Self-Reflection

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

RLHF Workflow: From Reward Modeling to Online RLHF

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Optimistic Critic Reconstruction and Constrained Fine-Tuning for General Offline-to-Online RL

Toward Optimal LLM Alignments Using Two-Player Games

Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Fine-tuning Language Models with Generative Adversarial Feedback

Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model

Reflexion: Language Agents with Verbal Reinforcement Learning

Online Learning from Strategic Human Feedback in LLM Fine-Tuning

MetaReflection: Learning Instructions for Language Agents using Past Reflections

An Emulator for Fine-Tuning Large Language Models using Small Language Models

Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF

Teaching Large Language Models to Reason with Reinforcement Learning

SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

RLSF: Reinforcement Learning via Symbolic Feedback