Efficient sample reuse in policy gradients with parameter-based exploration

Tingting Zhao,Hirotaka Hachiya,Voot Tangkaratt,Jun Morimoto,Masashi Sugiyama
DOI: https://doi.org/10.1162/NECO_a_00452
Abstract:The policy gradient approach is a flexible and powerful reinforcement learning method particularly for problems with continuous actions such as robot control. A common challenge is how to reduce the variance of policy gradient estimates for reliable policy updates. In this letter, we combine the following three ideas and give a highly effective policy gradient method: (1) policy gradients with parameter-based exploration, a recently proposed policy search method with low variance of gradient estimates; (2) an importance sampling technique, which allows us to reuse previously gathered data in a consistent way; and (3) an optimal baseline, which minimizes the variance of gradient estimates with their unbiasedness being maintained. For the proposed method, we give a theoretical analysis of the variance of gradient estimates and show its usefulness through extensive experiments.
What problem does this paper attempt to address?