On Thompson Sampling and Asymptotic Optimality

J. Leike,Tor Lattimore,Laurent Orseau,Marcus Hutter
DOI: https://doi.org/10.24963/ijcai.2017/688
2017-08-01
Abstract:We discuss some recent results on Thompson sampling for nonparametric reinforcement learning in countable classes of general stochastic environments. These environments can be non-Markovian, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges in mean to the optimal value and (2) given a recoverability assumption regret is sublinear. We conclude with a discussion about optimality in reinforcement learning.
What problem does this paper attempt to address?