Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Tianhao Wu,Yunchang Yang,Han Zhong,Liwei Wang,Simon S. Du,Jiantao Jiao
2022-01-01
Abstract:Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in Shani et al. (2020) is only (O) over tilde(root S(2)AH(4)K) where S is the number of states, A is the number of actions, H is the horizon, and K is the number of episodes, and there is a root SH gap compared with the information theoretic lower bound (Omega) over tilde(root SAH(3)K) (Jin et al., 2018). To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (RPO-SAT), which features the property "Stable at Any Time". We prove that our algorithm achieves (O) over tilde(root SAH(3)K+ root AH(4)K) regret. When S > H, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.
What problem does this paper attempt to address?