Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

Zechu Li,Rickmer Krohn,Tao Chen,Anurag Ajay,Pulkit Agrawal,Georgia Chalvatzaki

2024-06-02

Abstract:Deep reinforcement learning (RL) algorithms typically parameterize the policy as a deep network that outputs either a deterministic action or a stochastic one modeled as a Gaussian distribution, hence restricting learning to a single behavioral mode. Meanwhile, diffusion models emerged as a powerful framework for multimodal learning. However, the use of diffusion policies in online RL is hindered by the intractability of policy likelihood approximation, as well as the greedy objective of RL methods that can easily skew the policy to a single mode. This paper presents Deep Diffusion Policy Gradient (DDiffPG), a novel actor-critic algorithm that learns from scratch multimodal policies parameterized as diffusion models while discovering and maintaining versatile behaviors. DDiffPG explores and discovers multiple modes through off-the-shelf unsupervised clustering combined with novelty-based intrinsic motivation. DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective, ensuring the improvement of the diffusion policy across all modes. Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes. Empirical studies validate DDiffPG's capability to master multimodal behaviors in complex, high-dimensional continuous control tasks with sparse rewards, also showcasing proof-of-concept dynamic online replanning when navigating mazes with unseen obstacles.

Machine Learning

What problem does this paper attempt to address?

The paper primarily discusses the problem of most algorithms in deep reinforcement learning (RL) being able to learn only a single behavioral pattern. The authors propose a novel actor-critic algorithm called Deep Diffusion Policy Gradient (DDiffPG), which can learn multimodal policies from scratch. These policies are parameterized in the form of diffusion models, allowing for the discovery and maintenance of diverse behaviors. In traditional RL approaches, policies are often parameterized as deep networks that output deterministic actions or Gaussian distributions, limiting the possibility of learning multimodal behaviors. DDiffPG explores and discovers various patterns through intrinsic motivation of outliers and unsupervised hierarchical clustering. It forms multimodal training batches and uses pattern-specific Q-learning to alleviate the greediness of RL objectives, ensuring improvement of diffusion policies across all modes. Additionally, the method allows policies to be conditioned on pattern-specific embeddings, enabling direct control over the learned patterns. Experimental results demonstrate that DDiffPG can master multimodal behaviors in complex, high-dimensional continuous control tasks, even in the presence of sparse rewards and dynamic online replanning (such as encountering unseen obstacles while navigating a maze). In summary, the paper attempts to address how to achieve learning of multimodal policies in RL to adapt to non-static environments, avoid local optima, and facilitate continual learning when introducing new skills or solutions, overcoming the limitations of traditional methods.

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Policy Representation via Diffusion Probability Model for Reinforcement Learning

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Dueling Network Architecture for Multi-Agent Deep Deterministic Policy Gradient

Generating Behaviorally Diverse Policies with Latent Diffusion Models

Diffusion Policies for Out-of-Distribution Generalization in Offline Reinforcement Learning

Diffusion Policies creating a Trust Region for Offline Reinforcement Learning

Score Regularized Policy Optimization Through Diffusion Behavior

Diffusion Policy Policy Optimization

Diffusion Actor-Critic with Entropy Regulator

Multi-Agent Deep Deterministic Policy Gradient Algorithm Based on Classification Experience Replay

Learning a Diffusion Model Policy from Rewards via Q-Score Matching

DiffPoGAN: Diffusion Policies with Generative Adversarial Networks for Offline Reinforcement Learning

Efficient Diffusion Policies for Offline Reinforcement Learning

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization