Sample Efficient Reinforcement Learning with Double Importance Sampling Weight Clipping

Jiale Han,Mingxiao Feng,Wengang Zhou,Houqiang Li
DOI: https://doi.org/10.1109/cog57401.2023.10333147
2023-01-01
Abstract:Proximal Policy Optimization (PPO) is a stable on-policy policy gradient (PG) method thanks to its clipped importance sampling (IS) weight objective of policy improvement. However, on-policy PG methods usually suffer from poor sample efficiency. In contrast, off-policy methods have demonstrated better sample efficiency by making more effective use of all collected samples during training. In this work, we aim to develop methods that inherit both the stability of on-policy PG methods and the data efficiency of off-policy methods. To this end, we present GeDISC, an off-policy algorithm that improves sample efficiency by reusing off-policy samples drawn from prior policies. Besides, we propose double IS weight clipping to control the high instability caused by off-policy data. We take the recently proposed generalized clipping mechanism for off-policy data as the first clipping to bound the policy update from the current policy and meanwhile we extend the standard clipping mechanism in PPO as the second clipping to prevent high variance and bias brought by extremely old samples. Extensive experiments on continuous and discrete control tasks show that the proposed new algorithm outperforms PPO and other SOTA PPO-based off-policy algorithms.
What problem does this paper attempt to address?