Policy Optimization with Stochastic Mirror Descent.

Long Yang,Yu Zhang,Gang Zheng,Qian Zheng,Pengfei Li,Jianghang Huang,Gang Pan
DOI: https://doi.org/10.1609/aaai.v36i8.20863
2022-01-01
Proceedings of the AAAI Conference on Artificial Intelligence
Abstract:Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(ε−3) sample trajectories to achieve an ε-approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMP outperforms the state-of-the-art policy gradient methods in various settings.
What problem does this paper attempt to address?