Abstract:Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate $Q$-values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues. Implementation of REFUEL can be found at <a class="link-external link-https" href="https://github.com/ZhaolinGao/REFUEL/" rel="external noopener nofollow">this https URL</a>, and models trained by REFUEL can be found at <a class="link-external link-https" href="https://huggingface.co/Cornell-AGI" rel="external noopener nofollow">this https URL</a>.

Glue - Enhancing Compatibility and Flexibility of Reinforcement Learning Platforms by Decoupling Algorithms and Environments.

Robot Simulation and Reinforcement Learning Training Platform Based on Distributed Architecture.

A Framework for Mapping DRL Algorithms with Prioritized Replay Buffer onto Heterogeneous Platforms

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

Efficient Reinforcement Learning via Decoupling Exploration and Utilization

ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games

CaiRL: A High-Performance Reinforcement Learning Environment Toolkit

HybridFlow: A Flexible and Efficient RLHF Framework

EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine

Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild

An Offline-Transfer-Online Framework for Cloud-Edge Collaborative Distributed Reinforcement Learning

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform

Eden: A Unified Environment Framework for Booming Reinforcement Learning Algorithms

SERL: A Software Suite for Sample-Efficient Robotic Reinforcement Learning

Decoupled Alignment for Robust Plug-and-Play Adaptation

Scilab-RL: A software framework for efficient reinforcement learning and cognitive modeling research

PUZZLE: Efficiently Aligning Large Language Models Through Light-Weight Context Switch.

Efficient Parallel Reinforcement Learning Framework using the Reactor Model

Co-Adaptation of Algorithmic and Implementational Innovations in Inference-based Deep Reinforcement Learning

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding