Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient.

Xiang Li,Wenhao Yang,Zhihua Zhang,Michael I. Jordan
2021-01-01
Abstract:We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-leaning) in a γ-discounted MDP. We establish asymptotic normality for the averaged iteration Q̄T . Furthermore, we show that Q̄T is actually a regular asymptotically linear (RAL) estimator for the optimal Q-value function Q∗ with the most efficient influence function. It implies the averaged Q-learning iteration has the smallest asymptotic variance among all RAL estimators. In addition, we present a non-asymptotic analysis for the `∞ error E‖Q̄T − Q‖∞, showing it matches the instance-dependent lower bound as well as the optimal minimax complexity lower bound. As a byproduct, we find the Bellman noise has sub-Gaussian coordinates with variance O((1− γ)−1) instead of the prevailing O((1− γ)−2) under the standard bounded reward assumption. The sub-Gaussian result has potential to improve the sample complexity of many RL algorithms. In short, our theoretical analysis shows averaged Q-Leaning is statistically efficient.
What problem does this paper attempt to address?