Abstract:Diffusion models have become a popular choice for representing actor policies in behavior cloning and offline reinforcement learning. This is due to their natural ability to optimize an expressive class of distributions over a continuous space. However, previous works fail to exploit the score-based structure of diffusion models, and instead utilize a simple behavior cloning term to train the actor, limiting their ability in the actor-critic setting. In this paper, we present a theoretical framework linking the structure of diffusion model policies to a learned Q-function, by linking the structure between the score of the policy to the action gradient of the Q-function. We focus on off-policy reinforcement learning and propose a new policy update method from this theory, which we denote Q-score matching. Notably, this algorithm only needs to differentiate through the denoising model rather than the entire diffusion model evaluation, and converged policies through Q-score matching are implicitly multi-modal and explorative in continuous domains. We conduct experiments in simulated environments to demonstrate the viability of our proposed method and compare to popular baselines. Source code is available from the project website: <a class="link-external link-https" href="https://michaelpsenka.io/qsm" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use diffusion models to represent behavior policies in reinforcement learning (RL) and optimize these policies through the Q - Score Matching method, thereby improving the performance and sample efficiency of the algorithm. ### Specific problem description 1. **Limitations of existing methods**: - Diffusion models perform well in behavior cloning and offline reinforcement learning because they can optimize complex distributions in continuous spaces. However, most previous works have failed to fully utilize the score - based structure of diffusion models. Instead, they simply use behavior cloning terms to train behavior policies, which limits their performance in actor - critic settings. 2. **Requirement for new optimization methods**: - To overcome the above limitations, the author proposes a new theoretical framework that links the structure of diffusion model policies with the learned Q function and updates the policy by matching the policy score (∇alog(π(a|s))) with the action gradient (∇aQ(s, a)). This method is called Q - Score Matching (QSM). ### Core idea of the solution - **Q - Score Matching**: Optimize the policy by iteratively matching the policy score with the action gradient of the Q function. Specifically, the author proves that in deterministic and stochastic environments, if the policy score is aligned with the action gradient of the Q function, the Q value can be strictly increased. - **Multimodality and exploration**: Through the Q - Score Matching method, the converged policy has implicit multimodality and exploration in the continuous domain, which means that the policy can select multiple optimal action patterns in different states. ### Experimental verification - **Experimental environment**: The author conducted experiments in multiple simulated environments, including tasks such as Cartpole Balance and Cheetah Run. - **Performance comparison**: The experimental results show that QSM is comparable or superior to the existing SAC and TD3 algorithms in sample efficiency, especially in tasks that require fewer samples to achieve high rewards. - **Multimodal policy**: QSM can learn multimodal policies. For example, in a simple toy cartpole swingup task, QSM has learned two different initial action patterns (- 1 and 1), corresponding to two optimal trajectories respectively. ### Summary The main contribution of this paper is to propose a new method - Q - Score Matching for optimizing diffusion model policies in reinforcement learning. Through theoretical analysis and experimental verification, QSM not only improves the performance of the algorithm but also enhances the multimodality and exploration of the policy, thereby better approximating the true optimal policy.

Learning a Diffusion Model Policy from Rewards via Q-Score Matching

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Score Regularized Policy Optimization Through Diffusion Behavior

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization

Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

Generating Behaviorally Diverse Policies with Latent Diffusion Models

Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning

Policy Representation via Diffusion Probability Model for Reinforcement Learning

Sampling from Energy-based Policies using Diffusion

Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control

Model-free Policy Learning with Reward Gradients

DiffCPS: Diffusion Model based Constrained Policy Search for Offline Reinforcement Learning

Diffusion Policies creating a Trust Region for Offline Reinforcement Learning

Goal-Conditioned Imitation Learning using Score-based Diffusion Policies

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Diffusion Spectral Representation for Reinforcement Learning

Training Diffusion Models with Reinforcement Learning

Extracting Reward Functions from Diffusion Models

Boosting Continuous Control with Consistency Policy

Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning