Imitating Language via Scalable Inverse Reinforcement Learning

Markus Wulfmeier,Michael Bloesch,Nino Vieillard,Arun Ahuja,Jorg Bornschein,Sandy Huang,Artem Sokolov,Matt Barnes,Guillaume Desjardins,Alex Bewley,Sarah Maria Elisabeth Bechtle,Jost Tobias Springenberg,Nikola Momchev,Olivier Bachem,Matthieu Geist,Martin Riedmiller
2024-09-03
Abstract:The majority of language model training builds on imitation learning. It covers pretraining, supervised fine-tuning, and affects the starting conditions for reinforcement learning from human feedback (RLHF). The simplicity and scalability of maximum likelihood estimation (MLE) for next token prediction led to its role as predominant paradigm. However, the broader field of imitation learning can more effectively utilize the sequential structure underlying autoregressive generation. We focus on investigating the inverse reinforcement learning (IRL) perspective to imitation, extracting rewards and directly optimizing sequences instead of individual token likelihoods and evaluate its benefits for fine-tuning large language models. We provide a new angle, reformulating inverse soft-Q-learning as a temporal difference regularized extension of MLE. This creates a principled connection between MLE and IRL and allows trading off added complexity with increased performance and diversity of generations in the supervised fine-tuning (SFT) setting. We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance, rendering IRL a strong alternative on fixed SFT datasets even without online data generation. Our analysis of IRL-extracted reward functions further indicates benefits for more robust reward functions via tighter integration of supervised and preference-based LLM post-training.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of the Maximum Likelihood Estimation (MLE) method in the training of existing language models when generating high - quality and diverse texts. Specifically, the paper explores the application of Inverse Reinforcement Learning (IRL) methods in the fine - tuning of language models, aiming to improve the balance between task performance and generation diversity by directly optimizing sequence generation rather than the likelihood of individual tokens. The specific objectives of the paper include: 1. **Explore RL - based optimization methods**: In particular, when fine - tuning large - scale language models, study Inverse Reinforcement Learning (IRL) from the perspective of distribution matching, compare it with the traditional MLE method to extract the reward function and directly optimize the action sequence generation. 2. **Establish the connection between MLE and IRL**: By reformulating Inverse Soft Q - Learning as an MLE extension of temporal difference regularization, explicitly establish the connection between MLE and algorithms that utilize the sequential nature behind language generation, while allowing for a trade - off between increased complexity and improved performance and generation diversity. 3. **Evaluate the effects of different IRL methods**: Compare multiple IRL methods, including adversarial and non - adversarial, offline and online methods, to improve the understanding of imitation learning in large - scale language models (LLMs). Experimental results show that the IRL method significantly improves the diversity of model generation while maintaining or enhancing task performance, especially on the fixed - supervision fine - tuning (SFT) dataset that does not rely on online data generation. 4. **Analyze the extracted reward functions**: Further analyze the reward functions extracted from the demonstration data by the IRL method, point out that these reward functions may help to obtain more robust reward functions, and achieve better alignment with human intentions through more closely combining supervision and preference - oriented LLM post - training. In summary, through in - depth research on the application of IRL methods in the fine - tuning of language models, this paper aims to provide a new and effective alternative to the traditional MLE method, especially in cases where the balance between task performance and generation diversity is required.