Learning a Diffusion Model Policy from Rewards via Q-Score Matching

Michael Psenka,Alejandro Escontrela,Pieter Abbeel,Yi Ma
2024-07-16
Abstract:Diffusion models have become a popular choice for representing actor policies in behavior cloning and offline reinforcement learning. This is due to their natural ability to optimize an expressive class of distributions over a continuous space. However, previous works fail to exploit the score-based structure of diffusion models, and instead utilize a simple behavior cloning term to train the actor, limiting their ability in the actor-critic setting. In this paper, we present a theoretical framework linking the structure of diffusion model policies to a learned Q-function, by linking the structure between the score of the policy to the action gradient of the Q-function. We focus on off-policy reinforcement learning and propose a new policy update method from this theory, which we denote Q-score matching. Notably, this algorithm only needs to differentiate through the denoising model rather than the entire diffusion model evaluation, and converged policies through Q-score matching are implicitly multi-modal and explorative in continuous domains. We conduct experiments in simulated environments to demonstrate the viability of our proposed method and compare to popular baselines. Source code is available from the project website: <a class="link-external link-https" href="https://michaelpsenka.io/qsm" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use diffusion models to represent behavior policies in reinforcement learning (RL) and optimize these policies through the Q - Score Matching method, thereby improving the performance and sample efficiency of the algorithm. ### Specific problem description 1. **Limitations of existing methods**: - Diffusion models perform well in behavior cloning and offline reinforcement learning because they can optimize complex distributions in continuous spaces. However, most previous works have failed to fully utilize the score - based structure of diffusion models. Instead, they simply use behavior cloning terms to train behavior policies, which limits their performance in actor - critic settings. 2. **Requirement for new optimization methods**: - To overcome the above limitations, the author proposes a new theoretical framework that links the structure of diffusion model policies with the learned Q function and updates the policy by matching the policy score (∇alog(π(a|s))) with the action gradient (∇aQ(s, a)). This method is called Q - Score Matching (QSM). ### Core idea of the solution - **Q - Score Matching**: Optimize the policy by iteratively matching the policy score with the action gradient of the Q function. Specifically, the author proves that in deterministic and stochastic environments, if the policy score is aligned with the action gradient of the Q function, the Q value can be strictly increased. - **Multimodality and exploration**: Through the Q - Score Matching method, the converged policy has implicit multimodality and exploration in the continuous domain, which means that the policy can select multiple optimal action patterns in different states. ### Experimental verification - **Experimental environment**: The author conducted experiments in multiple simulated environments, including tasks such as Cartpole Balance and Cheetah Run. - **Performance comparison**: The experimental results show that QSM is comparable or superior to the existing SAC and TD3 algorithms in sample efficiency, especially in tasks that require fewer samples to achieve high rewards. - **Multimodal policy**: QSM can learn multimodal policies. For example, in a simple toy cartpole swingup task, QSM has learned two different initial action patterns (- 1 and 1), corresponding to two optimal trajectories respectively. ### Summary The main contribution of this paper is to propose a new method - Q - Score Matching for optimizing diffusion model policies in reinforcement learning. Through theoretical analysis and experimental verification, QSM not only improves the performance of the algorithm but also enhances the multimodality and exploration of the policy, thereby better approximating the true optimal policy.