Abstract:The majority of language model training builds on imitation learning. It covers pretraining, supervised fine-tuning, and affects the starting conditions for reinforcement learning from human feedback (RLHF). The simplicity and scalability of maximum likelihood estimation (MLE) for next token prediction led to its role as predominant paradigm. However, the broader field of imitation learning can more effectively utilize the sequential structure underlying autoregressive generation. We focus on investigating the inverse reinforcement learning (IRL) perspective to imitation, extracting rewards and directly optimizing sequences instead of individual token likelihoods and evaluate its benefits for fine-tuning large language models. We provide a new angle, reformulating inverse soft-Q-learning as a temporal difference regularized extension of MLE. This creates a principled connection between MLE and IRL and allows trading off added complexity with increased performance and diversity of generations in the supervised fine-tuning (SFT) setting. We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance, rendering IRL a strong alternative on fixed SFT datasets even without online data generation. Our analysis of IRL-extracted reward functions further indicates benefits for more robust reward functions via tighter integration of supervised and preference-based LLM post-training.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of the Maximum Likelihood Estimation (MLE) method in the training of existing language models when generating high - quality and diverse texts. Specifically, the paper explores the application of Inverse Reinforcement Learning (IRL) methods in the fine - tuning of language models, aiming to improve the balance between task performance and generation diversity by directly optimizing sequence generation rather than the likelihood of individual tokens. The specific objectives of the paper include: 1. **Explore RL - based optimization methods**: In particular, when fine - tuning large - scale language models, study Inverse Reinforcement Learning (IRL) from the perspective of distribution matching, compare it with the traditional MLE method to extract the reward function and directly optimize the action sequence generation. 2. **Establish the connection between MLE and IRL**: By reformulating Inverse Soft Q - Learning as an MLE extension of temporal difference regularization, explicitly establish the connection between MLE and algorithms that utilize the sequential nature behind language generation, while allowing for a trade - off between increased complexity and improved performance and generation diversity. 3. **Evaluate the effects of different IRL methods**: Compare multiple IRL methods, including adversarial and non - adversarial, offline and online methods, to improve the understanding of imitation learning in large - scale language models (LLMs). Experimental results show that the IRL method significantly improves the diversity of model generation while maintaining or enhancing task performance, especially on the fixed - supervision fine - tuning (SFT) dataset that does not rely on online data generation. 4. **Analyze the extracted reward functions**: Further analyze the reward functions extracted from the demonstration data by the IRL method, point out that these reward functions may help to obtain more robust reward functions, and achieve better alignment with human intentions through more closely combining supervision and preference - oriented LLM post - training. In summary, through in - depth research on the application of IRL methods in the fine - tuning of language models, this paper aims to provide a new and effective alternative to the traditional MLE method, especially in cases where the balance between task performance and generation diversity is required.

Imitating Language via Scalable Inverse Reinforcement Learning

Offline RL for Natural Language Generation with Implicit Language Q Learning

Teaching Large Language Models to Reason with Reinforcement Learning

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation

Training Language Models with Language Feedback at Scale

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint

Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

In-Context Learning with Reinforcement Learning for Incomplete Utterance Rewriting

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features

Few-shot In-Context Preference Learning Using Large Language Models

Fine-Tuning Language Models with Reward Learning on Policy

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Learning Goal-Conditioned Representations for Language Reward Models

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting

RLSF: Reinforcement Learning via Symbolic Feedback

Fine-Tuning Language Models from Human Preferences