Abstract:Experience replay serves as a key component in the success of online reinforcement learning (RL). Prioritized experience replay (PER) reweights experiences by the temporal difference (TD) error empirically enhancing the performance. However, few works have explored the motivation of using TD error. In this work, we provide an alternative perspective on TD-error-based reweighting. We show the connections between the experience prioritization and occupancy optimization. By using a regularized RL objective with $f-$divergence regularizer and employing its dual form, we show that an optimal solution to the objective is obtained by shifting the distribution of off-policy data in the replay buffer towards the on-policy optimal distribution using TD-error-based occupancy ratios. Our derivation results in a new pipeline of TD error prioritization. We specifically explore the KL divergence as the regularizer and obtain a new form of prioritization scheme, the regularized optimal experience replay (ROER). We evaluate the proposed prioritization scheme with the Soft Actor-Critic (SAC) algorithm in continuous control MuJoCo and DM Control benchmark tasks where our proposed scheme outperforms baselines in 6 out of 11 tasks while the results of the rest match with or do not deviate far from the baselines. Further, using pretraining, ROER achieves noticeable improvement on difficult Antmaze environment where baselines fail, showing applicability to offline-to-online fine-tuning. Code is available at \url{<a class="link-external link-https" href="https://github.com/XavierChanglingLi/Regularized-Optimal-Experience-Replay" rel="external noopener nofollow">this https URL</a>}.

An Approach to Optimize Replay Buffer in Value-Based Reinforcement Learning.

Revisiting Prioritized Experience Replay: A Value Perspective

A Reinforcement Learning Sampling Optimization Method Based on Training Value

Leveraging Efficiency Through Hybrid Prioritized Experience Replay in Door Environment.

Efficient Diversity-based Experience Replay for Deep Reinforcement Learning

Understanding the effect of varying amounts of replay per step

High-Value Prioritized Experience Replay For Off-Policy Reinforcement Learning

Prioritized Generative Replay

Prioritized Experience Replay

Double Replay Buffers with Restricted Gradient.

Regret Minimization Experience Replay in Off-Policy Reinforcement Learning

Topological Experience Replay

Prioritized Experience Replay for Multi-agent Cooperation

ROER: Regularized Optimal Experience Replay

Z-Score Experience Replay in Off-Policy Deep Reinforcement Learning

Experience Selection In Multi-Agent Deep Reinforcement Learning

Balanced Prioritized Experience Replay in Off-Policy Reinforcement Learning

ACDER: Augmented Curiosity-Driven Experience Replay

Efficient Multi-Goal Reinforcement Learning Via Value Consistency Prioritization

Replay across Experiments: A Natural Extension of Off-Policy RL

A framework of dual replay buffer: Balancing forgetting and generalization in reinforcement learning