Abstract:Value-based reinforcement learning (RL) can in principle learn effective policies for a wide range of multi-turn problems, from games to dialogue to robotic control, including via offline RL from static previously collected datasets. However, despite the widespread use of policy gradient methods to train large language models for single turn tasks (e.g., question answering), value-based methods for multi-turn RL in an off-policy or offline setting have proven particularly challenging to scale to the setting of large language models. This setting requires effectively leveraging pretraining, scaling to large architectures with billions of parameters, and training on large datasets, all of which represent major challenges for current value-based RL methods. In this work, we propose a novel offline RL algorithm that addresses these drawbacks, casting Q-learning as a modified supervised fine-tuning (SFT) problem where the probabilities of tokens directly translate to Q-values. In this way we obtain an algorithm that smoothly transitions from maximizing the likelihood of the data during pretraining to learning a near-optimal Q-function during finetuning. Our algorithm has strong theoretical foundations, enjoying performance bounds similar to state-of-the-art Q-learning methods, while in practice utilizing an objective that closely resembles SFT. Because of this, our approach can enjoy the full benefits of the pretraining of language models, without the need to reinitialize any weights before RL finetuning, and without the need to initialize new heads for predicting values or advantages. Empirically, we evaluate our method on both pretrained LLMs and VLMs, on a variety of tasks including both natural language dialogue and robotic manipulation and navigation from images.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to effectively apply value - based reinforcement learning (RL) methods, especially Q - learning, in large - scale language models (LLMs) or multimodal models (VLMs) to overcome the limitations of existing methods in multi - round tasks. Specifically: 1. **Challenges in multi - round tasks**: Although the policy gradient method has been widely used to train large - scale language models to perform single - step tasks (such as question - answering), value - based methods (such as Q - learning) encounter significant challenges when extended to large - scale language models in multi - round tasks (such as dialogue or robot control). These challenges include effectively utilizing pre - training, expanding to large - scale architectures with billions of parameters, and training on large - scale datasets. 2. **Limitations of existing methods**: Existing offline reinforcement learning methods perform poorly when dealing with large - scale language models, mainly because they need to regress the value function, which will lead to an unstable learning objective in large - scale networks. In addition, these methods usually need to re - initialize weights or add new prediction heads to predict values or advantages during the fine - tuning process, which results in the inability to fully utilize the prior knowledge in the pre - trained model. 3. **Proposed solution**: The paper proposes a new offline reinforcement learning algorithm - Q - SFT (Q - Learning via Supervised Fine - Tuning), which directly uses the log - probability of the pre - trained model to train the value function by transforming Q - learning into a modified supervised fine - tuning problem. The advantage of this is that it can smoothly transition from pre - training that maximizes data likelihood to the fine - tuning stage of learning an approximately optimal Q - function without re - initializing weights or adding new prediction heads. ### Specific contributions - **Theoretical basis**: The paper provides a theoretical analysis, proving that the proposed algorithm is comparable in performance to the state - of - the - art Q - learning methods while using a target function similar to supervised fine - tuning in practice. - **Practical effects**: The experimental results show that Q - SFT performs well on a variety of tasks, including natural language dialogue, image - based navigation, and robot manipulation. In particular, in multi - round tasks, Q - SFT can better utilize the advantages of the pre - trained model without requiring additional interaction data. ### Summary The main goal of the paper is to solve the challenges of value - based reinforcement learning methods when applied to multi - round tasks in large - scale language models or multimodal models by proposing the Q - SFT algorithm, thereby improving the performance of these models in complex sequential decision - making tasks.

Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

Offline RL for Natural Language Generation with Implicit Language Q Learning

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Intuitive Fine-Tuning: Towards Unifying SFT and RLHF into a Single Process

Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning.

SUF: Stabilized Unconstrained Fine-Tuning for Offline-to-Online Reinforcement Learning

Pretrained LLM Adapted with LoRA as a Decision Transformer for Offline RL in Quantitative Trading

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Real-World Offline Reinforcement Learning from Vision Language Model Feedback

Fine-Tuning Language Models with Reward Learning on Policy

HIQL: Offline Goal-Conditioned RL with Latent States as Actions

Self-Evolution Fine-Tuning for Policy Optimization

Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions

Scalable Signal Temporal Logic Guided Reinforcement Learning via Value Function Space Optimization

Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

Value Augmented Sampling for Language Model Alignment and Personalization

SDV: Simple Double Validation Model-based Offline Reinforcement Learning

Leveraging Offline Data in Online Reinforcement Learning