Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Wanqiao Xu,Shi Dong,Dilip Arumugam,Benjamin Van Roy
2023-05-19
Abstract:A centerpiece of the ever-popular reinforcement learning from human feedback (RLHF) approach to fine-tuning autoregressive language models is the explicit training of a reward model to emulate human feedback, distinct from the language model itself. This reward model is then coupled with policy-gradient methods to dramatically improve the alignment between language model outputs and desired responses. In this work, we adopt a novel perspective wherein a pre-trained language model is itself simultaneously a policy, reward function, and transition function. An immediate consequence of this is that reward learning and language model fine-tuning can be performed jointly and directly, without requiring any further downstream policy optimization. While this perspective does indeed break the traditional agent-environment interface, we nevertheless maintain that there can be enormous statistical benefits afforded by bringing to bear traditional algorithmic concepts from reinforcement learning. Our experiments demonstrate one concrete instance of this through efficient exploration based on the representation and resolution of epistemic uncertainty. In order to illustrate these ideas in a transparent manner, we restrict attention to a simple didactic data generating process and leave for future work extension to systems of practical scale.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of the current Reinforcement Learning from Human Feedback (RLHF) methods when fine - tuning large - language models (LLMs). Specifically, the paper points out that traditional RLHF methods explicitly train a reward model to simulate human feedback and use it in combination with the policy gradient method to improve the alignment between the language model output and the desired response. However, this method results in agglomerative models, which tend to concentrate all probability mass on a single "best" response preferred by the majority of the population, ignoring diverse user preferences. To solve this problem, the paper proposes a new perspective, that is, the pre - trained language model itself can serve as a policy, a reward function, and an environment simulator simultaneously. Based on this perspective, the paper introduces a new fine - tuning algorithm - Inclusive Learning From Human Feedback (ILHF). The ILHF algorithm has the following two main advantages: 1. **Computational efficiency**: ILHF avoids the need to further apply the policy gradient method during the fine - tuning process, thus simplifying the computational flow. 2. **Statistical efficiency**: The model generated by ILHF is inclusive and can converge to a response distribution that reflects the preferences of the entire user group, rather than just focusing on the preferences of the majority. Through experiments, the paper shows that ILHF can successfully learn an inclusive response distribution in simple examples, while traditional RLHF methods tend to produce agglomerative models. In addition, the paper also explores how to improve the statistical efficiency of ILHF by using exploration strategies in reinforcement learning, especially its potential in dealing with large - scale problems. In conclusion, this paper aims to propose a new fine - tuning method by breaking the traditional agent - environment interface to more effectively capture and reflect the diverse preferences of the user group.