Abstract:We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinite-horizon case (PSDP$_\infty$), and the recently proposed Non-Stationary Policy iteration (NSPI(m)). For all algorithms, we describe performance bounds, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API/API($\alpha$), but this comes at the cost of a relative---exponential in $\frac{1}{\epsilon}$---increase of the number of iterations. 2) PSDP$_\infty$ enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP$_\infty$ is proportional to their number of iterations, which may be problematic when the discount factor $\gamma$ is close to 1 or the approximation error $\epsilon$ is close to $0$; we show that the NSPI(m) algorithm allows to make an overall trade-off between memory and performance. Simulations with these schemes confirm our analysis.

Discrete-time Generalized Policy Iteration ADP Algorithm with Approximation Errors.

Discrete-Time Stable Generalized Self-Learning Optimal Control With Approximation Errors.

Discrete-Time Nonlinear Generalized Policy Iteration for Optimal Control Using Neural Networks

Error bound analysis of policy iteration based approximate dynamic programming for deterministic discrete-time nonlinear systems

Policy Approximation in Policy Iteration Approximate Dynamic Programming for Discrete-Time Nonlinear Systems.

Approximate Modified Policy Iteration

Finite-approximation-error-based Discrete-Time Iterative Adaptive Dynamic Programming.

Modified general policy iteration based adaptive dynamic programming for unknown discrete‐time linear systems

Approximate Policy Iteration Schemes: A Comparison

Policy Iteration Approximate Dynamic Programming Using Volterra Series Based Actor

Hamiltonian-Driven Adaptive Dynamic Programming With Approximation Errors

Parametric Approximation Policy Iteration Algorithm Based on Gaussian Process

Local Policy Iteration Adaptive Dynamic Programming for Discrete-Time Nonlinear Systems

Efficient approximate dynamic programming based on design and analysis of computer experiments for infinite-horizon optimization

Error Bound Analysis of Q-Function for Discounted Optimal Control Problems With Policy Iteration.

Nonparametric approximation generalized policy iteration reinforcement learning algorithm based on states clustering

Infinite Horizon Self-Learning Optimal Control of Nonaffine Discrete-Time Nonlinear Systems

Discrete-Time Optimal Control Via Local Policy Iteration Adaptive Dynamic Programming

Twin Deterministic Policy Gradient Adaptive Dynamic Programming for Optimal Control of Affine Nonlinear Discrete-time Systems

Revisiting approximate dynamic programming and its convergence

Theoretical and Numerical Analysis of Approximate Dynamic Programming with Approximation Errors