Abstract:A centerpiece of the ever-popular reinforcement learning from human feedback (RLHF) approach to fine-tuning autoregressive language models is the explicit training of a reward model to emulate human feedback, distinct from the language model itself. This reward model is then coupled with policy-gradient methods to dramatically improve the alignment between language model outputs and desired responses. In this work, we adopt a novel perspective wherein a pre-trained language model is itself simultaneously a policy, reward function, and transition function. An immediate consequence of this is that reward learning and language model fine-tuning can be performed jointly and directly, without requiring any further downstream policy optimization. While this perspective does indeed break the traditional agent-environment interface, we nevertheless maintain that there can be enormous statistical benefits afforded by bringing to bear traditional algorithmic concepts from reinforcement learning. Our experiments demonstrate one concrete instance of this through efficient exploration based on the representation and resolution of epistemic uncertainty. In order to illustrate these ideas in a transparent manner, we restrict attention to a simple didactic data generating process and leave for future work extension to systems of practical scale.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of the current Reinforcement Learning from Human Feedback (RLHF) methods when fine - tuning large - language models (LLMs). Specifically, the paper points out that traditional RLHF methods explicitly train a reward model to simulate human feedback and use it in combination with the policy gradient method to improve the alignment between the language model output and the desired response. However, this method results in agglomerative models, which tend to concentrate all probability mass on a single "best" response preferred by the majority of the population, ignoring diverse user preferences. To solve this problem, the paper proposes a new perspective, that is, the pre - trained language model itself can serve as a policy, a reward function, and an environment simulator simultaneously. Based on this perspective, the paper introduces a new fine - tuning algorithm - Inclusive Learning From Human Feedback (ILHF). The ILHF algorithm has the following two main advantages: 1. **Computational efficiency**: ILHF avoids the need to further apply the policy gradient method during the fine - tuning process, thus simplifying the computational flow. 2. **Statistical efficiency**: The model generated by ILHF is inclusive and can converge to a response distribution that reflects the preferences of the entire user group, rather than just focusing on the preferences of the majority. Through experiments, the paper shows that ILHF can successfully learn an inclusive response distribution in simple examples, while traditional RLHF methods tend to produce agglomerative models. In addition, the paper also explores how to improve the statistical efficiency of ILHF by using exploration strategies in reinforcement learning, especially its potential in dealing with large - scale problems. In conclusion, this paper aims to propose a new fine - tuning method by breaking the traditional agent - environment interface to more effectively capture and reflect the diverse preferences of the user group.

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Fine-Tuning Language Models with Reward Learning on Policy

Navigating Noisy Feedback: Enhancing Reinforcement Learning with Error-Prone Language Models

Personalized Language Modeling from Personalized Human Feedback

Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization

Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation

Reinforcement Learning without Human Feedback for Last Mile Fine-Tuning of Large Language Models

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Mental Modeling of Reinforcement Learning Agents by Language Models

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Fine-Tuning Language Models from Human Preferences

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study

Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features