Abstract:RLHF-aligned LMs have shown unprecedented ability on both benchmarks and long-form text generation, yet they struggle with one foundational task: next-token prediction. As RLHF models become agent models aimed at interacting with humans, they seem to lose their world modeling -- the ability to predict what comes next in arbitrary documents, which is the foundational training objective of the Base LMs that RLHF adapts. Besides empirically demonstrating this trade-off, we propose a potential explanation: to perform coherent long-form generation, RLHF models restrict randomness via implicit blueprints. In particular, RLHF models concentrate probability on sets of anchor spans that co-occur across multiple generations for the same prompt, serving as textual scaffolding but also limiting a model's ability to generate documents that do not include these spans. We study this trade-off on the most effective current agent models, those aligned with RLHF, while exploring why this may remain a fundamental trade-off between models that act and those that predict, even as alignment techniques improve.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper explores the trade-off between world modeling and agent modeling when fine-tuning language models through Reinforcement Learning with Human Feedback (RLHF). **Specifically, the paper attempts to address the following issues:** 1. **Decline in World Modeling Capability**: - Language models fine-tuned through RLHF perform well on complex tasks but poorly on basic next-token prediction tasks. This indicates that these models, in becoming agent models (aimed at interacting with humans), have lost their world modeling capability, i.e., the ability to predict what might come next in any given document. 2. **Concentration of Probability Distribution**: - RLHF models limit the randomness of generated text through implicit blueprints, concentrating probabilities on a specific set of text spans. These anchor spans appear repeatedly in multiple generated samples, forming the framework of text generation but also limiting the model's ability to generate documents without these spans. 3. **Self-Predictability and Long Text Generation**: - To generate coherent long texts, RLHF models need to ensure high predictability of future content based on their current state. This means agent models need to minimize long-term uncertainty, while world modeling requires maintaining the true uncertainty of natural text, presenting a fundamental trade-off. 4. **Fundamental Trade-Off**: - The paper explores whether this trade-off is fundamental, i.e., whether it will persist even with improvements in alignment techniques. The authors argue that self-predictability is an unavoidable aspect of successful agent models because they need to ensure their future behavior is predictable based on their current state. ### Main Findings - **Performance Comparison**: - RLHF models have significantly higher perplexity on language modeling tasks compared to base models, and even after retraining, they cannot reach the performance level of base models. - **Concentration of Probability Distribution**: - RLHF models have a more concentrated probability distribution, assigning higher probabilities to a few words while almost not assigning any probability to most words during text generation. - **Implicit Blueprints**: - RLHF models use implicit blueprints when generating long texts, guiding the generation process through anchor spans to achieve better coherence and predictability. - **Self-Predictability**: - RLHF models exhibit lower perplexity when generating their own text, indicating that they remain within a high-confidence region during the generation process, which aids in long text generation. ### Conclusion The paper concludes that the trade-off between world modeling and agent modeling is fundamental. Even though new alignment methods may alleviate some issues, this trade-off will still exist under fixed capacity. Agent models need to reduce long-term uncertainty to generate coherent long texts, while world modeling requires maintaining the true uncertainty of natural text. Therefore, future systems may need to combine world models and agent models rather than relying on a single model to perform both actions and predictions simultaneously.

Predicting vs. Acting: A Trade-off Between World Modeling & Agent Modeling

Stabilizing RLHF through Advantage Model and Selective Rehearsal

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Confronting Reward Model Overoptimization with Constrained RLHF

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Mental Modeling of Reinforcement Learning Agents by Language Models

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Fine-tuning Language Models with Generative Adversarial Feedback

Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment

Generative Reward Models

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

A Critical Look At Tokenwise Reward-Guided Text Generation

Secrets of RLHF in Large Language Models Part I: PPO

Reward-Robust RLHF in LLMs

Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

Fine-tuning Language Models with Generative Adversarial Reward Modelling

Personalized Language Modeling from Personalized Human Feedback