Abstract:Reinforcement Learning from Human Feedback (RLHF) is widely used to align Language Models (LMs) with human preferences. However, existing approaches often neglect individual user preferences, leading to suboptimal personalization. We present the Preference Pretrained Transformer (PPT), a novel approach for adaptive personalization using online user feedback. PPT leverages the in-context learning capabilities of transformers to dynamically adapt to individual preferences. Our approach consists of two phases: (1) an offline phase where we train a single policy model using a history-dependent loss function, and (2) an online phase where the model adapts to user preferences through in-context learning. We demonstrate PPT's effectiveness in a contextual bandit setting, showing that it achieves personalized adaptation superior to existing methods while significantly reducing the computational costs. Our results suggest the potential of in-context learning for scalable and efficient personalization in large language models.

What problem does this paper attempt to address?

The paper attempts to address the issue of effectively capturing and responding to individual user preferences when achieving personalized adaptation in Reinforcement Learning from Human Feedback (RLHF). Existing RLHF methods often overlook the differences in individual user preferences, leading to insufficient personalization, especially for users whose preferences differ from the majority. To solve this problem, the paper proposes the Preference Pretrained Transformer (PPT) model, which dynamically adjusts personalization through online user feedback, aiming to improve the effectiveness of personalized adaptation while reducing computational costs. Specifically, the PPT model includes two stages: 1. **Offline Training Stage**: In this stage, a single policy model is trained using a history-dependent loss function, which can predict the preferred answer based on the historical responses for each preference criterion. 2. **Online Inference Stage**: In this stage, the model generates two potential responses for the user to choose from through in-context learning, and continuously adjusts its generated responses based on user feedback to better adapt to the user's personal preferences. The paper validates the effectiveness of the PPT model through experiments, particularly in the contextual bandit setting. The PPT model not only achieves better personalized adaptation than existing methods but also significantly reduces computational costs. This indicates that utilizing in-context learning for the personalization of large-scale language models is a feasible and efficient approach.

Personalized Adaptation via In-Context Preference Learning

Personalized Language Modeling from Personalized Human Feedback

Preference Transformer: Modeling Human Preferences using Transformers for RL

Unsupervised Human Preference Learning

Orchestrating LLMs with Different Personalizations

Few-shot In-Context Preference Learning Using Large Language Models

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Democratizing Large Language Models via Personalized Parameter-Efficient Fine-tuning

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Active Preference-based Learning for Multi-dimensional Personalization

FedPC: Federated Learning for Language Generation with Personal and Context Preference Embeddings

Reinforcement Learning for Personalized Dialogue Management

Adaptive Self-Supervised Learning Strategies for Dynamic On-Device LLM Personalization

LLMs + Persona-Plug = Personalized LLMs

Persona-DB: Efficient Large Language Model Personalization for Response Prediction with Collaborative Data Refinement

Contextual Adapters for Personalized Speech Recognition in Neural Transducers

RLPF: Reinforcement Learning from Prediction Feedback for User Summarization with LLMs

Personalization in Human-Robot Interaction through Preference-based Action Representation Learning

Factual and Personalized Recommendations using Language Models and Reinforcement Learning