Abstract:In offline reinforcement learning (RL), it is necessary to manage out-of-distribution actions to prevent overestimation of value functions. Policy-regularized methods address this problem by constraining the target policy to stay close to the behavior policy. Although several approaches suggest representing the behavior policy as an expressive diffusion model to boost performance, it remains unclear how to regularize the target policy given a diffusion-modeled behavior sampler. In this paper, we propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem, enabling direct representation of target policies as diffusion models. Our approach follows the actor-critic learning paradigm that we alternatively train a diffusion-modeled target policy and a critic network. The actor training loss includes a soft Q-guidance term from the Q-gradient. The soft Q-guidance grounds on the theoretical solution of the KL constraint policy iteration, which prevents the learned policy from taking out-of-distribution actions. For critic training, we train a Q-ensemble to stabilize the estimation of Q-gradient. Additionally, DAC employs lower confidence bound (LCB) to address the overestimation and underestimation of value targets due to function approximation error. Our approach is evaluated on the D4RL benchmarks and outperforms the state-of-the-art in almost all environments. Code is available at \href{<a class="link-external link-https" href="https://github.com/Fang-Lin93/DAC" rel="external noopener nofollow">this https URL</a>}{\texttt{<a class="link-external link-http" href="http://github.com/Fang-Lin93/DAC" rel="external noopener nofollow">this http URL</a>}}.

Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration

Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning

Non-local Policy Optimization via Diversity-regularized Collaborative Exploration

Diffusion Actor-Critic with Entropy Regulator

Promoting Stochasticity for Expressive Policies Via a Simple and Efficient Regularization Method.

An Entropy Regularization Free Mechanism for Policy-based Reinforcement Learning

Increasing Entropy to Boost Policy Gradient Performance on Personalization Tasks

Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network

Distributional Soft Actor Critic for Risk Sensitive Learning

Fast Rates for Maximum Entropy Exploration

ACE : Off-Policy Actor-Critic with Causality-Aware Entropy Regularization

Diverse Exploration for Fast and Safe Policy Improvement

Wasserstein Diversity-Enriched Regularizer for Hierarchical Reinforcement Learning

Open-Ended Diverse Solution Discovery with Regulated Behavior Patterns for Cross-Domain Adaptation

Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

Maximum Entropy Diverse Exploration: Disentangling Maximum Entropy Reinforcement Learning

Adaptive Exploration Network Policy for Effective Exploration in Reinforcement Learning

A Maximum Divergence Approach to Optimal Policy in Deep Reinforcement Learning

Entropy annealing for policy mirror descent in continuous time and space

Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction

Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation