Abstract:Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. It has been verified that utilizing diffusion policies can significantly improve the performance of RL algorithms in continuous control tasks by overcoming the limitations of unimodal policies, such as Gaussian policies, and providing the agent with enhanced exploration capabilities. However, existing works mainly focus on the application of diffusion policies in offline RL, while their incorporation into online RL is less investigated. The training objective of the diffusion model, known as the variational lower bound, cannot be optimized directly in online RL due to the unavailability of 'good' actions. This leads to difficulties in conducting diffusion policy improvement. To overcome this, we propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO). Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. To fulfill these conditions, the Q-weight transformation functions are introduced for general scenarios. Additionally, to further enhance the exploration capability of the diffusion policy, we design a special entropy regularization term. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions. Consequently, the QVPO algorithm leverages the exploration capabilities and multimodality of diffusion policies, preventing the RL agent from converging to a sub-optimal policy. To verify the effectiveness of QVPO, we conduct comprehensive experiments on MuJoCo benchmarks. The final results demonstrate that QVPO achieves state-of-the-art performance on both cumulative reward and sample efficiency.

Variational Policy Propagation for Multi-agent Reinforcement Learning

VMAPD: Generate Diverse Solutions for Multi-Agent Games with Recurrent Trajectory Discriminators.

Intention Propagation for Multi-agent Reinforcement Learning

Variational Automatic Curriculum Learning for Sparse-Reward Cooperative Multi-Agent Problems

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

Multi-Agent Path Finding Method Based on Evolutionary Reinforcement Learning

Multi-agent Reinforcement Learning with Deep Networks for Diverse Q-Vectors

A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization

Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient

Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning.

A Variational Approach to Mutual Information-Based Coordination for Multi-Agent Reinforcement Learning

Variational Inequality Methods for Multi-Agent Reinforcement Learning: Performance and Stability Gains

Multi-Agent Reinforcement Learning via Distributed MPC as a Function Approximator

Backpropagation Through Agents

Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning

Scalable and Sample Efficient Distributed Policy Gradient Algorithms in Multi-Agent Networked Systems

Multi-View Reinforcement Learning

Variational Inference for Policy Gradient

Biologically Plausible Variational Policy Gradient with Spiking Recurrent Winner-Take-All Networks