Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

Zihan Zhang,Yuan Zhou,Xiangyang Ji
2020-01-01
Abstract:In this paper we consider the problem of learning an epsilon-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with S states, A actions, the discount factor gamma is an element of (0, 1), and an approximation threshold epsilon > 0, we provide a model-free algorithm to learn an 6-optimal policy with sample complexity (O) over tilde (SA ln(1/p)/epsilon(2)(1 - gamma)(5.5)) (1) and success probability (1 - p). For small enough epsilon, we show an improved algorithm with sample complexity (O) over tilde (SA ln(1/p)/epsilon(2)(1 - gamma)(3)). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on S, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.
What problem does this paper attempt to address?