Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Shipra Agrawal,Randy Jia
DOI: https://doi.org/10.1287/moor.2022.1266
IF: 2.215
2022-05-08
Mathematics of Operations Research
Abstract:We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov decision process (MDP) is communicating with a finite, although unknown, diameter. Our main result is a high probability regret upper bound of [Formula: see text] for any communicating MDP with S states, A actions, and diameter D. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy in time horizon T. This result closely matches the known lower bound of [Formula: see text]. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.
mathematics, applied,operations research & management science
What problem does this paper attempt to address?