Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Ming Yin,Yu Bai,Yu-Xiang Wang
DOI: https://doi.org/10.48550/arXiv.2102.01748
2021-02-03
Abstract:We consider the problem of offline reinforcement learning (RL) -- a well-motivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal sample complexity, remain largely open even in basic settings such as \emph{tabular} Markov Decision Processes (MDPs). In this paper, we propose Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an $\epsilon$-optimal policy with $\widetilde{O}(H^2/d_m\epsilon^2)$ episodes of offline data in the finite-horizon stationary transition setting, where $H$ is the horizon length and $d_m$ is the minimal marginal state-action distribution induced by the behavior policy. This improves over the best known upper bound by a factor of $H$. Moreover, we establish an information-theoretic lower bound of $\Omega(H^2/d_m\epsilon^2)$ which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal sample complexity under alternative settings such as the finite-horizon MDPs with non-stationary transitions and the infinite horizon MDPs with discounted rewards.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### The problems the paper attempts to solve The paper "Near - Optimal Offline Reinforcement Learning via Double Variance Reduction" aims to solve the key problems in offline reinforcement learning (Offline RL). Specifically, the paper focuses on how to find an approximately optimal policy using only historical data. Although offline RL is very useful in many practical applications, such as in robotics, education, autonomous driving, and healthcare, its theoretical understanding is still insufficient, especially in terms of sample complexity. ### Main contributions 1. **Propose a new algorithm**: - The authors propose a new variance - reduction - based algorithm - Off - Policy Double Variance Reduction (OPDVR). This algorithm can find an approximately optimal policy on a static offline dataset by performing stochastic (mini - batch - style) value iteration using offline data. 2. **Improvement in sample complexity**: - The authors prove that in an environment with a fixed transition setting within a finite time horizon, OPDVR can find an \(\epsilon\)-optimal policy with high probability, and the required dataset size is \(\tilde{O}(H^2 / d_m\epsilon^2)\), where \(H\) is the length of the time horizon and \(d_m\) is the minimum marginal state - action distribution of the behavior policy in a given Markov decision process (MDP). This result is \(H\) times better than the best known upper bound. 3. **Theoretical lower bound**: - The authors establish a sample complexity lower bound \(\Omega(H^2 / d_m\epsilon^2)\) for offline RL within a finite time horizon, proving that OPDVR is optimal within a logarithmic factor. 4. **Performance in other settings**: - The authors also show the performance of OPDVR in non - fixed transition settings and infinite time horizons, achieving sample complexities of \(\tilde{O}(H^3 / d_m\epsilon^2)\) and \(\tilde{O}((1 - \gamma)^{- 3}/d_m\epsilon^2)\) respectively, and these results are all optimal within a logarithmic factor. ### Technical details - **Variance reduction techniques**: - The authors extend variance reduction techniques and deal with the problem of dependence on the initial optimization gap \(u(0)\) through two - stage variance reduction. This technique has been applied in generative model settings, but has been adapted and improved in the offline setting. - **Estimator design**: - The authors design two estimators \(z_t\) and \(g_t\) to estimate two key terms in the Bellman backup operation. These estimators use independent batch data from the offline dataset, avoid over - optimism through lower confidence bound (LCB) updates, and prevent pessimism through the max operation. - **Initialization dependence**: - Through a two - stage doubling process, the authors solve the problem of an overly large initial optimization gap caused by standard initialization (such as \(V_t^{(0)}=0\)). In the first stage, an intermediate precision \(\epsilon'=\sqrt{H}\epsilon\) is roughly learned, and in the second stage, the error is further reduced from \(\epsilon'\) to \(\epsilon\). ### Conclusion This paper makes important theoretical progress in offline reinforcement learning, especially in terms of sample complexity, by proposing the OPDVR algorithm. These results are not only theoretically significant but also provide new tools and methods for practical applications.