Abstract:Diffusion models have achieved remarkable results in image generation, and have similarly been used to learn high-performing policies in sequential decision-making tasks. Decision-making diffusion models can be trained on lower-quality data, and then be steered with a reward function to generate near-optimal trajectories. We consider the problem of extracting a reward function by comparing a decision-making diffusion model that models low-reward behavior and one that models high-reward behavior; a setting related to inverse reinforcement learning. We first define the notion of a relative reward function of two diffusion models and show conditions under which it exists and is unique. We then devise a practical learning algorithm for extracting it by aligning the gradients of a reward function -- parametrized by a neural network -- to the difference in outputs of both diffusion models. Our method finds correct reward functions in navigation environments, and we demonstrate that steering the base model with the learned reward functions results in significantly increased performance in standard locomotion benchmarks. Finally, we demonstrate that our approach generalizes beyond sequential decision-making by learning a reward-like function from two large-scale image generation diffusion models. The extracted reward function successfully assigns lower rewards to harmful images.

What problem does this paper attempt to address?

The paper aims to address the problem of extracting reward functions from decision diffusion models. Specifically, the authors propose a method to extract a relative reward function by comparing a decision diffusion model that models low-reward behavior with one that models high-reward behavior. This method does not require environment access, simulators, or iterative policy optimization, and it is applicable to both continuous and discrete diffusion models. In the paper, the authors first define the relative reward function between two diffusion models and demonstrate the conditions for its existence and uniqueness. Then, they design a practical learning algorithm to extract this relative reward function by aligning the gradient of the reward function with the differences between the outputs of the two diffusion models. Experimental results show that guiding the base model using the learned reward function can significantly improve performance, and this method can also be extended to non-sequential decision tasks, such as learning reward-like functions from image generation diffusion models. In summary, the main contributions of the paper include: 1. Proposing the concept of a relative reward function between diffusion models and conducting a mathematical analysis of its relationship with rewards in sequential decision-making. 2. Proposing a practical learning algorithm to extract the relative reward function by aligning the gradient of the reward function with the differences in the outputs of the two diffusion models. 3. Validating the effectiveness of the method in long-term planning environments, high-dimensional control environments, and tasks beyond sequential decision-making.

Extracting Reward Functions from Diffusion Models

Reward-Directed Conditional Diffusion: Provable Distribution Estimation and Reward Improvement

Towards Controllable Diffusion Models via Reward-Guided Exploration

Diffusion Reward: Learning Rewards via Conditional Video Diffusion

Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Feedback Efficient Online Fine-Tuning of Diffusion Models

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases

Training Diffusion Models with Reinforcement Learning

Large-scale Reinforcement Learning for Diffusion Models

Reward Shaping via Diffusion Process in Reinforcement Learning

Diffusion Spectral Representation for Reinforcement Learning

Maximum Entropy Inverse Reinforcement Learning of Diffusion Models with Energy-Based Models

A reinforcement learning diffusion decision model for value-based decisions

Learning a Diffusion Model Policy from Rewards via Q-Score Matching

Diffusion-Reward Adversarial Imitation Learning

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models