Personalized Adaptation via In-Context Preference Learning

Allison Lau,Younwoo Choi,Vahid Balazadeh,Keertana Chidambaram,Vasilis Syrgkanis,Rahul G. Krishnan
2024-10-18
Abstract:Reinforcement Learning from Human Feedback (RLHF) is widely used to align Language Models (LMs) with human preferences. However, existing approaches often neglect individual user preferences, leading to suboptimal personalization. We present the Preference Pretrained Transformer (PPT), a novel approach for adaptive personalization using online user feedback. PPT leverages the in-context learning capabilities of transformers to dynamically adapt to individual preferences. Our approach consists of two phases: (1) an offline phase where we train a single policy model using a history-dependent loss function, and (2) an online phase where the model adapts to user preferences through in-context learning. We demonstrate PPT's effectiveness in a contextual bandit setting, showing that it achieves personalized adaptation superior to existing methods while significantly reducing the computational costs. Our results suggest the potential of in-context learning for scalable and efficient personalization in large language models.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of effectively capturing and responding to individual user preferences when achieving personalized adaptation in Reinforcement Learning from Human Feedback (RLHF). Existing RLHF methods often overlook the differences in individual user preferences, leading to insufficient personalization, especially for users whose preferences differ from the majority. To solve this problem, the paper proposes the Preference Pretrained Transformer (PPT) model, which dynamically adjusts personalization through online user feedback, aiming to improve the effectiveness of personalized adaptation while reducing computational costs. Specifically, the PPT model includes two stages: 1. **Offline Training Stage**: In this stage, a single policy model is trained using a history-dependent loss function, which can predict the preferred answer based on the historical responses for each preference criterion. 2. **Online Inference Stage**: In this stage, the model generates two potential responses for the user to choose from through in-context learning, and continuously adjusts its generated responses based on user feedback to better adapt to the user's personal preferences. The paper validates the effectiveness of the PPT model through experiments, particularly in the contextual bandit setting. The PPT model not only achieves better personalized adaptation than existing methods but also significantly reduces computational costs. This indicates that utilizing in-context learning for the personalization of large-scale language models is a feasible and efficient approach.