Truncated Variance Reduced Value Iteration

Yujia Jin,Ishani Karmarkar,Aaron Sidford,Jiayi Wang
2024-05-22
Abstract:We provide faster randomized algorithms for computing an $\epsilon$-optimal policy in a discounted Markov decision process with $A_{\text{tot}}$-state-action pairs, bounded rewards, and discount factor $\gamma$. We provide an $\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2} + (1 - \gamma)^{-2}])$-time algorithm in the sampling setting, where the probability transition matrix is unknown but accessible through a generative model which can be queried in $\tilde{O}(1)$-time, and an $\tilde{O}(s + (1-\gamma)^{-2})$-time algorithm in the offline setting where the probability transition matrix is known and $s$-sparse. These results improve upon the prior state-of-the-art which either ran in $\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2} + (1 - \gamma)^{-3}])$ time [Sidford, Wang, Wu, Ye 2018] in the sampling setting, $\tilde{O}(s + A_{\text{tot}} (1-\gamma)^{-3})$ time [Sidford, Wang, Wu, Yang, Ye 2018] in the offline setting, or time at least quadratic in the number of states using interior point methods for linear programming. We achieve our results by building upon prior stochastic variance-reduced value iteration methods [Sidford, Wang, Wu, Yang, Ye 2018]. We provide a variant that carefully truncates the progress of its iterates to improve the variance of new variance-reduced sampling procedures that we introduce to implement the steps. Our method is essentially model-free and can be implemented in $\tilde{O}(A_{\text{tot}})$-space when given generative model access. Consequently, our results take a step in closing the sample-complexity gap between model-free and model-based methods.
Machine Learning,Data Structures and Algorithms,Optimization and Control
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the optimization problems in the Discounted Markov Decision Process (DMDP). Specifically, the authors propose a faster stochastic algorithm to calculate the \(\epsilon\)-optimal policy given the state - action pairs \(A_{\text{tot}}\), bounded rewards and the discount factor \(\gamma\). ### Main contributions 1. **Algorithm in the sample setting**: - An algorithm with a running - time complexity of \(\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2}+(1 - \gamma)^{-2}])\) in the sample setting is proposed. Here, the probability transition matrix is unknown but can be queried through the generative model. - This result improves the previous best result \(\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2}+(1 - \gamma)^{-3}])\). 2. **Algorithm in the offline setting**: - An algorithm with a running - time complexity of \(\tilde{O}(s + A_{\text{tot}}(1 - \gamma)^{-2})\) in the offline setting is proposed. Here, the probability transition matrix is known and \(s\)-sparse. - This result improves the previous best result \(\tilde{O}(s + A_{\text{tot}}(1 - \gamma)^{-3})\). 3. **Method improvement**: - The algorithm is based on the previous stochastic variance - reduced value - iteration method and reduces the variance of the newly introduced variance - reduced sampling steps by carefully truncating the progress of the iteration. - The method is basically model - free and can be implemented within \(\tilde{O}(A_{\text{tot}})\) space, provided that access to the generative model is given. ### Technical details 1. **Value iteration**: - The traditional value - iteration method gradually approximates the optimal value function \(v^*\) by repeatedly applying the Bellman operator \(T\). Specifically, at each iteration, the value function \(v^{(t)}\) is updated as: \[ v^{(t)}(s)\leftarrow\max_{a\in A_s}(r_a(s)+\gamma p_a(s)^{\top}v^{(t - 1)}) \] - The time complexity of this method in the offline setting is \(\tilde{O}(\text{nnz}(P)(1 - \gamma)^{-1})\). 2. **Stochastic value iteration and variance reduction**: - The stochastic value - iteration method applies value iteration in the sample setting by using stochastic estimates to approximate the expected utility \(p_a(s)^{\top}v\). - Variance - reduction techniques reduce the variance by more precisely approximating the expected utility \(p_a(s)^{\top}v\) for each state - action pair. 3. **Recursive variance reduction**: - Recursive variance reduction is achieved by recursively estimating the change \(\Delta^{(t)}\approx P(v^{(t)}-v^{(t - 1)})\) and maintaining its cumulative sum \(g^{(t)}\). - The specific update formula is: \[ g^{(t)}\leftarrow g^{(t - 1)}+\Delta^{(t)} \] 4. **Truncated value iteration**: - To further reduce the variance, the authors propose the truncated value - iteration method, which ensures that the variance at each step is not too large by limiting the amount of change in the value function at each iteration. - The specific update formula is: