Abstract:In offline reinforcement learning (RL), it is necessary to manage out-of-distribution actions to prevent overestimation of value functions. Policy-regularized methods address this problem by constraining the target policy to stay close to the behavior policy. Although several approaches suggest representing the behavior policy as an expressive diffusion model to boost performance, it remains unclear how to regularize the target policy given a diffusion-modeled behavior sampler. In this paper, we propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem, enabling direct representation of target policies as diffusion models. Our approach follows the actor-critic learning paradigm that we alternatively train a diffusion-modeled target policy and a critic network. The actor training loss includes a soft Q-guidance term from the Q-gradient. The soft Q-guidance grounds on the theoretical solution of the KL constraint policy iteration, which prevents the learned policy from taking out-of-distribution actions. For critic training, we train a Q-ensemble to stabilize the estimation of Q-gradient. Additionally, DAC employs lower confidence bound (LCB) to address the overestimation and underestimation of value targets due to function approximation error. Our approach is evaluated on the D4RL benchmarks and outperforms the state-of-the-art in almost all environments. Code is available at \href{<a class="link-external link-https" href="https://github.com/Fang-Lin93/DAC" rel="external noopener nofollow">this https URL</a>}{\texttt{<a class="link-external link-http" href="http://github.com/Fang-Lin93/DAC" rel="external noopener nofollow">this http URL</a>}}.

Actor-Critic Alignment for Offline-to-Online Reinforcement Learning

DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

Behavior Proximal Policy Optimization

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization

Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL

In-sample Actor Critic for Offline Reinforcement Learning

Efficient Offline Reinforcement Learning: The Critic is Critical

Online Meta-Critic Learning for Off-Policy Actor-Critic Methods

Robust Offline Reinforcement Learning from Low-Quality Data

Importance Weighted Actor-Critic for Optimal Conservative Offline Reinforcement Learning

Online Tuning for Offline Decentralized Multi-Agent Reinforcement Learning

Understanding the performance gap between online and offline alignment algorithms

Identifying drug-induced lung injury in a patient with rheumatoid arthritis

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning

Deploying Offline Reinforcement Learning with Human Feedback

Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification

Offline Decentralized Multi-Agent Reinforcement Learning

Efficient and Stable Offline-to-online Reinforcement Learning Via Continual Policy Revitalization

Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning