Abstract:We provide faster randomized algorithms for computing an $\epsilon$-optimal policy in a discounted Markov decision process with $A_{\text{tot}}$-state-action pairs, bounded rewards, and discount factor $\gamma$. We provide an $\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2} + (1 - \gamma)^{-2}])$-time algorithm in the sampling setting, where the probability transition matrix is unknown but accessible through a generative model which can be queried in $\tilde{O}(1)$-time, and an $\tilde{O}(s + (1-\gamma)^{-2})$-time algorithm in the offline setting where the probability transition matrix is known and $s$-sparse. These results improve upon the prior state-of-the-art which either ran in $\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2} + (1 - \gamma)^{-3}])$ time [Sidford, Wang, Wu, Ye 2018] in the sampling setting, $\tilde{O}(s + A_{\text{tot}} (1-\gamma)^{-3})$ time [Sidford, Wang, Wu, Yang, Ye 2018] in the offline setting, or time at least quadratic in the number of states using interior point methods for linear programming. We achieve our results by building upon prior stochastic variance-reduced value iteration methods [Sidford, Wang, Wu, Yang, Ye 2018]. We provide a variant that carefully truncates the progress of its iterates to improve the variance of new variance-reduced sampling procedures that we introduce to implement the steps. Our method is essentially model-free and can be implemented in $\tilde{O}(A_{\text{tot}})$-space when given generative model access. Consequently, our results take a step in closing the sample-complexity gap between model-free and model-based methods.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the optimization problems in the Discounted Markov Decision Process (DMDP). Specifically, the authors propose a faster stochastic algorithm to calculate the $\epsilon$-optimal policy given the state - action pairs $A_{\text{tot}}$, bounded rewards and the discount factor $\gamma$. ### Main contributions 1. **Algorithm in the sample setting**: - An algorithm with a running - time complexity of $\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2}+(1 - \gamma)^{-2}])$ in the sample setting is proposed. Here, the probability transition matrix is unknown but can be queried through the generative model. - This result improves the previous best result $\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2}+(1 - \gamma)^{-3}])$. 2. **Algorithm in the offline setting**: - An algorithm with a running - time complexity of $\tilde{O}(s + A_{\text{tot}}(1 - \gamma)^{-2})$ in the offline setting is proposed. Here, the probability transition matrix is known and $s$-sparse. - This result improves the previous best result $\tilde{O}(s + A_{\text{tot}}(1 - \gamma)^{-3})$. 3. **Method improvement**: - The algorithm is based on the previous stochastic variance - reduced value - iteration method and reduces the variance of the newly introduced variance - reduced sampling steps by carefully truncating the progress of the iteration. - The method is basically model - free and can be implemented within $\tilde{O}(A_{\text{tot}})$ space, provided that access to the generative model is given. ### Technical details 1. **Value iteration**: - The traditional value - iteration method gradually approximates the optimal value function $v^*$ by repeatedly applying the Bellman operator $T$. Specifically, at each iteration, the value function $v^{(t)}$ is updated as: \[ v^{(t)}(s)\leftarrow\max_{a\in A_s}(r_a(s)+\gamma p_a(s)^{\top}v^{(t - 1)}) \] - The time complexity of this method in the offline setting is $\tilde{O}(\text{nnz}(P)(1 - \gamma)^{-1})$. 2. **Stochastic value iteration and variance reduction**: - The stochastic value - iteration method applies value iteration in the sample setting by using stochastic estimates to approximate the expected utility $p_a(s)^{\top}v$. - Variance - reduction techniques reduce the variance by more precisely approximating the expected utility $p_a(s)^{\top}v$ for each state - action pair. 3. **Recursive variance reduction**: - Recursive variance reduction is achieved by recursively estimating the change $\Delta^{(t)}\approx P(v^{(t)}-v^{(t - 1)})$ and maintaining its cumulative sum $g^{(t)}$. - The specific update formula is: \[ g^{(t)}\leftarrow g^{(t - 1)}+\Delta^{(t)} \] 4. **Truncated value iteration**: - To further reduce the variance, the authors propose the truncated value - iteration method, which ensures that the variance at each step is not too large by limiting the amount of change in the value function at each iteration. - The specific update formula is:

Truncated Variance Reduced Value Iteration

Policy Optimization with Stochastic Mirror Descent.

Accelerated Stochastic ADMM with Variance Reduction

AsyncQVI: Asynchronous-Parallel Q-Value Iteration for Discounted Markov Decision Processes with Near-Optimal Sample Complexity

An Optimistic Value Iteration for Mean–variance Optimization in Discounted Markov Decision Processes

Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization

An Iterative Approach to Reduce the Variance of Stochastic Dynamic Systems

Variance-Reduced Policy Gradient Approaches for Infinite Horizon Average Reward Markov Decision Processes

Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time

Variance-Reduced Off-Policy Memory-Efficient Policy Search

Asynchronous value iteration for markov decision processes with continuous state spaces

Adaptive Variance Reducing for Stochastic Gradient Descent.

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Stochastic first-order methods for average-reward Markov decision processes

Reanalysis of Variance Reduced Temporal Difference Learning

An Improved Analysis and Rates for Variance Reduction under Without-replacement Sampling Orders.

Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP

Value-Gradient Iteration with Quadratic Approximate Value Functions

Stochastic Variance-Reduced Policy Gradient

Variance-Constrained Actor-Critic Algorithms for Discounted and Average Reward MDPs

Stochastic Variance-Reduced Newton: Accelerating Finite-Sum Minimization with Large Batches