Extracting Reward Functions from Diffusion Models

Felipe Nuti,Tim Franzmeyer,João F. Henriques
2023-12-09
Abstract:Diffusion models have achieved remarkable results in image generation, and have similarly been used to learn high-performing policies in sequential decision-making tasks. Decision-making diffusion models can be trained on lower-quality data, and then be steered with a reward function to generate near-optimal trajectories. We consider the problem of extracting a reward function by comparing a decision-making diffusion model that models low-reward behavior and one that models high-reward behavior; a setting related to inverse reinforcement learning. We first define the notion of a relative reward function of two diffusion models and show conditions under which it exists and is unique. We then devise a practical learning algorithm for extracting it by aligning the gradients of a reward function -- parametrized by a neural network -- to the difference in outputs of both diffusion models. Our method finds correct reward functions in navigation environments, and we demonstrate that steering the base model with the learned reward functions results in significantly increased performance in standard locomotion benchmarks. Finally, we demonstrate that our approach generalizes beyond sequential decision-making by learning a reward-like function from two large-scale image generation diffusion models. The extracted reward function successfully assigns lower rewards to harmful images.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the problem of extracting reward functions from decision diffusion models. Specifically, the authors propose a method to extract a relative reward function by comparing a decision diffusion model that models low-reward behavior with one that models high-reward behavior. This method does not require environment access, simulators, or iterative policy optimization, and it is applicable to both continuous and discrete diffusion models. In the paper, the authors first define the relative reward function between two diffusion models and demonstrate the conditions for its existence and uniqueness. Then, they design a practical learning algorithm to extract this relative reward function by aligning the gradient of the reward function with the differences between the outputs of the two diffusion models. Experimental results show that guiding the base model using the learned reward function can significantly improve performance, and this method can also be extended to non-sequential decision tasks, such as learning reward-like functions from image generation diffusion models. In summary, the main contributions of the paper include: 1. Proposing the concept of a relative reward function between diffusion models and conducting a mathematical analysis of its relationship with rewards in sequential decision-making. 2. Proposing a practical learning algorithm to extract the relative reward function by aligning the gradient of the reward function with the differences in the outputs of the two diffusion models. 3. Validating the effectiveness of the method in long-term planning environments, high-dimensional control environments, and tasks beyond sequential decision-making.