Abstract:We provide faster randomized algorithms for computing an $\epsilon$-optimal policy in a discounted Markov decision process with $A_{\text{tot}}$-state-action pairs, bounded rewards, and discount factor $\gamma$. We provide an $\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2} + (1 - \gamma)^{-2}])$-time algorithm in the sampling setting, where the probability transition matrix is unknown but accessible through a generative model which can be queried in $\tilde{O}(1)$-time, and an $\tilde{O}(s + (1-\gamma)^{-2})$-time algorithm in the offline setting where the probability transition matrix is known and $s$-sparse. These results improve upon the prior state-of-the-art which either ran in $\tilde{O}(A_{\text{tot}}[(1 - \gamma)^{-3}\epsilon^{-2} + (1 - \gamma)^{-3}])$ time [Sidford, Wang, Wu, Ye 2018] in the sampling setting, $\tilde{O}(s + A_{\text{tot}} (1-\gamma)^{-3})$ time [Sidford, Wang, Wu, Yang, Ye 2018] in the offline setting, or time at least quadratic in the number of states using interior point methods for linear programming. We achieve our results by building upon prior stochastic variance-reduced value iteration methods [Sidford, Wang, Wu, Yang, Ye 2018]. We provide a variant that carefully truncates the progress of its iterates to improve the variance of new variance-reduced sampling procedures that we introduce to implement the steps. Our method is essentially model-free and can be implemented in $\tilde{O}(A_{\text{tot}})$-space when given generative model access. Consequently, our results take a step in closing the sample-complexity gap between model-free and model-based methods.

Asynchronous value iteration for markov decision processes with continuous state spaces

Incremental Value Iteration for Time-Aggregated Markov-Decision Processes

A Q-learning algorithm for Markov decision processes with continuous state spaces

An Incremental Sampling-based Algorithm for Stochastic Optimal Control

Relative Q-Learning for Average-Reward Markov Decision Processes with Continuous States

Truncated Variance Reduced Value Iteration

An Accelerated Fitted Value Iteration Algorithm for MDPs with Finite and Vector-Valued Action Space

A policy iteration algorithm for non-Markovian control problems

A Probabilistic Forward Search Value Iteration Algorithm for POMDP

On Value Iteration Convergence in Connected MDPs

Manifold Regularization Based Approximate Value Iteration For Learning Control

A Probabilistic Greedy Search Value Iteration Algorithm For Pomdp

An Improved Method for Approximating the Infinite-Horizon Value Function of the Discrete-Time Switched LQR Problem

A Hybrid Heuristic Value Iteration Algorithm for Pomdp

Markov Decision Processes with Time-Varying Geometric Discounting

Infinite-Horizon Policy-Gradient Estimation with Variable Discount Factor for Markov Decision Process

Value-Gradient Iteration with Quadratic Approximate Value Functions

Intermittently Observable Markov Decision Processes

A Multi-Criteria Value Iteration Algorithm For Pomdp Problems

Simple method for efficiently solving dynamic models with continuous actions using policy gradient

Value Iteration Adaptive Dynamic Programming for Optimal Control of Discrete-Time Nonlinear Systems