ROER: Regularized Optimal Experience Replay

Changling Li,Zhang-Wei Hong,Pulkit Agrawal,Divyansh Garg,Joni Pajarinen
2024-07-04
Abstract:Experience replay serves as a key component in the success of online reinforcement learning (RL). Prioritized experience replay (PER) reweights experiences by the temporal difference (TD) error empirically enhancing the performance. However, few works have explored the motivation of using TD error. In this work, we provide an alternative perspective on TD-error-based reweighting. We show the connections between the experience prioritization and occupancy optimization. By using a regularized RL objective with $f-$divergence regularizer and employing its dual form, we show that an optimal solution to the objective is obtained by shifting the distribution of off-policy data in the replay buffer towards the on-policy optimal distribution using TD-error-based occupancy ratios. Our derivation results in a new pipeline of TD error prioritization. We specifically explore the KL divergence as the regularizer and obtain a new form of prioritization scheme, the regularized optimal experience replay (ROER). We evaluate the proposed prioritization scheme with the Soft Actor-Critic (SAC) algorithm in continuous control MuJoCo and DM Control benchmark tasks where our proposed scheme outperforms baselines in 6 out of 11 tasks while the results of the rest match with or do not deviate far from the baselines. Further, using pretraining, ROER achieves noticeable improvement on difficult Antmaze environment where baselines fail, showing applicability to offline-to-online fine-tuning. Code is available at \url{<a class="link-external link-https" href="https://github.com/XavierChanglingLi/Regularized-Optimal-Experience-Replay" rel="external noopener nofollow">this https URL</a>}.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The paper attempts to address the issue of prioritization in the Experience Replay mechanism in Reinforcement Learning (RL). Specifically: 1. **Problems with existing methods**: Existing prioritization methods, such as Prioritized Experience Replay (PER) based on Temporal Difference Error (TD Error), perform well in practice but lack a solid theoretical foundation. Additionally, when there is a significant difference between the data distribution in the replay buffer and the current policy's data distribution, TD error estimation may be inaccurate, leading to a decline in PER performance. 2. **New perspective proposed**: The authors provide a new perspective on TD error prioritization by linking experience prioritization with Occupancy Optimization. They demonstrate how to achieve the transition from offline data distribution to online optimal distribution through the dual form of the RL objective function with f-divergence regularization, thereby obtaining the optimal solution. 3. **New method ROER**: Based on the above theoretical analysis, the authors propose Regularized Optimal Experience Replay (ROER). This method uses KL divergence as a regularization term and is evaluated in conjunction with the Soft Actor-Critic algorithm on continuous control tasks. Experimental results show that ROER outperforms baseline methods in multiple tasks, particularly showing significant improvements in some challenging environments. 4. **Value estimation analysis**: Further analysis indicates that ROER can estimate the value function more accurately and converge to the true value more quickly, which helps improve overall performance. In summary, this paper aims to address the shortcomings of existing experience replay mechanisms through theoretical analysis and the design of a new method, thereby improving the performance of reinforcement learning algorithms.