Abstract:We consider the problem of offline reinforcement learning (RL) -- a well-motivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal sample complexity, remain largely open even in basic settings such as \emph{tabular} Markov Decision Processes (MDPs). In this paper, we propose Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an $\epsilon$-optimal policy with $\widetilde{O}(H^2/d_m\epsilon^2)$ episodes of offline data in the finite-horizon stationary transition setting, where $H$ is the horizon length and $d_m$ is the minimal marginal state-action distribution induced by the behavior policy. This improves over the best known upper bound by a factor of $H$. Moreover, we establish an information-theoretic lower bound of $\Omega(H^2/d_m\epsilon^2)$ which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal sample complexity under alternative settings such as the finite-horizon MDPs with non-stationary transitions and the infinite horizon MDPs with discounted rewards.

What problem does this paper attempt to address?

### The problems the paper attempts to solve The paper "Near - Optimal Offline Reinforcement Learning via Double Variance Reduction" aims to solve the key problems in offline reinforcement learning (Offline RL). Specifically, the paper focuses on how to find an approximately optimal policy using only historical data. Although offline RL is very useful in many practical applications, such as in robotics, education, autonomous driving, and healthcare, its theoretical understanding is still insufficient, especially in terms of sample complexity. ### Main contributions 1. **Propose a new algorithm**: - The authors propose a new variance - reduction - based algorithm - Off - Policy Double Variance Reduction (OPDVR). This algorithm can find an approximately optimal policy on a static offline dataset by performing stochastic (mini - batch - style) value iteration using offline data. 2. **Improvement in sample complexity**: - The authors prove that in an environment with a fixed transition setting within a finite time horizon, OPDVR can find an $\epsilon$-optimal policy with high probability, and the required dataset size is $\tilde{O}(H^2 / d_m\epsilon^2)$, where $H$ is the length of the time horizon and $d_m$ is the minimum marginal state - action distribution of the behavior policy in a given Markov decision process (MDP). This result is $H$ times better than the best known upper bound. 3. **Theoretical lower bound**: - The authors establish a sample complexity lower bound $\Omega(H^2 / d_m\epsilon^2)$ for offline RL within a finite time horizon, proving that OPDVR is optimal within a logarithmic factor. 4. **Performance in other settings**: - The authors also show the performance of OPDVR in non - fixed transition settings and infinite time horizons, achieving sample complexities of $\tilde{O}(H^3 / d_m\epsilon^2)$ and $\tilde{O}((1 - \gamma)^{- 3}/d_m\epsilon^2)$ respectively, and these results are all optimal within a logarithmic factor. ### Technical details - **Variance reduction techniques**: - The authors extend variance reduction techniques and deal with the problem of dependence on the initial optimization gap $u(0)$ through two - stage variance reduction. This technique has been applied in generative model settings, but has been adapted and improved in the offline setting. - **Estimator design**: - The authors design two estimators $z_t$ and $g_t$ to estimate two key terms in the Bellman backup operation. These estimators use independent batch data from the offline dataset, avoid over - optimism through lower confidence bound (LCB) updates, and prevent pessimism through the max operation. - **Initialization dependence**: - Through a two - stage doubling process, the authors solve the problem of an overly large initial optimization gap caused by standard initialization (such as $V_t^{(0)}=0$). In the first stage, an intermediate precision $\epsilon'=\sqrt{H}\epsilon$ is roughly learned, and in the second stage, the error is further reduced from $\epsilon'$ to $\epsilon$. ### Conclusion This paper makes important theoretical progress in offline reinforcement learning, especially in terms of sample complexity, by proposing the OPDVR algorithm. These results are not only theoretically significant but also provide new tools and methods for practical applications.

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Beyond Reward: Offline Preference-guided Policy Optimization

Offline Primal-Dual Reinforcement Learning for Linear MDPs

Offline Policy Optimization in RL with Variance Regularizaton

Nearly Horizon-Free Offline Reinforcement Learning

Settling the Sample Complexity of Model-Based Offline Reinforcement Learning

Nearly Minimax Optimal Offline Reinforcement Learning with Linear Function Approximation: Single-Agent MDP and Markov Game

A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs

Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

Robust Offline Reinforcement Learning for Non-Markovian Decision Processes

Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage

Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning

Offline Reinforcement Learning via Linear-Programming with Error-Bound Induced Constraints

Pessimism Meets Risk: Risk-Sensitive Offline Reinforcement Learning

Sample Complexity of Offline Distributionally Robust Linear Markov Decision Processes

Efficient Online Reinforcement Learning with Offline Data

Is Pessimism Provably Efficient for Offline Reinforcement Learning?

Achieving the Asymptotically Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach

Variance-Reduced Off-Policy Memory-Efficient Policy Search

What are the Statistical Limits of Offline RL with Linear Function Approximation?