On Value Iteration Convergence in Connected MDPs

Arsenii Mustafin,Alex Olshevsky,Ioannis Ch. Paschalidis

2024-06-14

Abstract:This paper establishes that an MDP with a unique optimal policy and ergodic associated transition matrix ensures the convergence of various versions of the Value Iteration algorithm at a geometric rate that exceeds the discount factor {\gamma} for both discounted and average-reward criteria.

Machine Learning

What problem does this paper attempt to address?

Convergence Analysis of an Incremental Approach to Online Inverse Reinforcement Learning

Zhuo-jun Jin,Hui Qian,Shen-yi Chen,Miao-liang Zhu

DOI: https://doi.org/10.1631/jzus.c1010010

2011-01-01

Abstract:Interest in inverse reinforcement learning （IRL） has recently increased,that is,interest in the problem of recovering the reward function underlying a Markov decision process （MDP） given the dynamics of the system and the behavior of an expert.This paper deals with an incremental approach to online IRL.First,the convergence property of the incremental method for the IRL problem was investigated,and the bounds of both the mistake number during the learning process and regret were provided by using a detailed proof.Then an online algorithm based on incremental error correcting was derived to deal with the IRL problem.The key idea is to add an increment to the current reward estimate each time an action mismatch occurs.This leads to an estimate that approaches a target optimal value.The proposed method was tested in a driving simulation experiment and found to be able to efficiently recover an adequate reward function.
On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation

Anna Winnicki,R. Srikant

DOI: https://doi.org/10.48550/arXiv.2301.09709

2023-02-28

Abstract:A common technique in reinforcement learning is to evaluate the value function from Monte Carlo simulations of a given policy, and use the estimated value function to obtain a new policy which is greedy with respect to the estimated value function. A well-known longstanding open problem in this context is to prove the convergence of such a scheme when the value function of a policy is estimated from data collected from a single sample path obtained from implementing the policy (see page 99 of [Sutton and Barto, 2018], page 8 of [Tsitsiklis, 2002]). We present a solution to the open problem by showing that a first-visit version of such a policy iteration scheme indeed converges to the optimal policy provided that the policy improvement step uses lookahead [Silver et al., 2016, Mnih et al., 2016, Silver et al., 2017b] rather than a simple greedy policy improvement. We provide results both for the original open problem in the tabular setting and also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.

Machine Learning,Artificial Intelligence,Systems and Control
Asynchronous value iteration for markov decision processes with continuous state spaces

Xiangyu Yang,Jian-Qiang Hu,Jiaqiao Hu,Yijie Peng

DOI: https://doi.org/10.1109/WSC48552.2020.9384120

2020-01-01

Abstract:We propose a simulation-based value iteration algorithm for approximately solving infinite horizon discounted MDPs with continuous state spaces and finite actions. At each time step, the algorithm employs the shrinking ball method to estimate the value function at sampled states and uses historical estimates in an interpolation-based fitting strategy to build an approximator of the optimal value function. Under moderate conditions, we prove that the sequence of approximators generated by the algorithm converges uniformly to the optimal value function with probability one. Simple numerical examples are provided to compare our algorithm with two other existing methods.
Incremental Value Iteration for Time-Aggregated Markov-Decision Processes

Tao Sun,Qianchuan Zhao,Peter B. Luh

DOI: https://doi.org/10.1109/TAC.2007.908359

2007-01-01

Abstract:A value iteration algorithm for time-aggregated Markov-decision processes (MDPs) is developed to solve problems with large state spaces. The algorithm is based on a novel approach which solves a time aggregated MDP by incrementally solving a set of standard MDPs. Therefore, the algorithm converges under the same assumption as standard value iteration. Such assumption is much weaker than that required by the existing time aggregated value iteration algorithm. The algorithms developed in this paper are also applicable to MDPs with fractional costs.
On the Convergence of Modified Policy Iteration in Risk Sensitive Exponential Cost Markov Decision Processes

Yashaswini Murthy,Mehrdad Moharrami,R. Srikant

2024-02-15

Abstract:Modified policy iteration (MPI) is a dynamic programming algorithm that combines elements of policy iteration and value iteration. The convergence of MPI has been well studied in the context of discounted and average-cost MDPs. In this work, we consider the exponential cost risk-sensitive MDP formulation, which is known to provide some robustness to model parameters. Although policy iteration and value iteration have been well studied in the context of risk sensitive MDPs, MPI is unexplored. We provide the first proof that MPI also converges for the risk-sensitive problem in the case of finite state and action spaces. Since the exponential cost formulation deals with the multiplicative Bellman equation, our main contribution is a convergence proof which is quite different than existing results for discounted and risk-neutral average-cost problems as well as risk sensitive value and policy iteration approaches. We conclude our analysis with simulation results, assessing MPI's performance relative to alternative dynamic programming methods like value iteration and policy iteration across diverse problem parameters. Our findings highlight risk-sensitive MPI's enhanced computational efficiency compared to both value and policy iteration techniques.

Machine Learning,Artificial Intelligence,Systems and Control
Deflated Dynamics Value Iteration

Jongmin Lee,Amin Rakhsha,Ernest K. Ryu,Amir-massoud Farahmand

2024-07-15

Abstract:The Value Iteration (VI) algorithm is an iterative procedure to compute the value function of a Markov decision process, and is the basis of many reinforcement learning (RL) algorithms as well. As the error convergence rate of VI as a function of iteration $k$ is $O(\gamma^k)$, it is slow when the discount factor $\gamma$ is close to $1$. To accelerate the computation of the value function, we propose Deflated Dynamics Value Iteration (DDVI). DDVI uses matrix splitting and matrix deflation techniques to effectively remove (deflate) the top $s$ dominant eigen-structure of the transition matrix $\mathcal{P}^{\pi}$. We prove that this leads to a $\tilde{O}(\gamma^k |\lambda_{s+1}|^k)$ convergence rate, where $\lambda_{s+1}$is $(s+1)$-th largest eigenvalue of the dynamics matrix. We then extend DDVI to the RL setting and present Deflated Dynamics Temporal Difference (DDTD) algorithm. We empirically show the effectiveness of the proposed algorithms.

Machine Learning,Optimization and Control
A policy iteration algorithm for non-Markovian control problems

Dylan Possamaï,Ludovic Tangpi

2024-09-06

Abstract:In this paper, we propose a new policy iteration algorithm to compute the value function and the optimal controls of continuous time stochastic control problems. The algorithm relies on successive approximations using linear-quadratic control problems which can all be solved explicitly, and only require to solve recursively linear PDEs in the Markovian case. Though our procedure fails in general to produce a non-decreasing sequence like the standard algorithm, it can be made arbitrarily close to being monotone. More importantly, we recover the standard exponential speed of convergence for both the value and the controls, through purely probabilistic arguments which are significantly simpler than in the classical case. Our proof also accommodates non-Markovian dynamics as well as volatility control, allowing us to obtain the first convergence results in the latter case for a state process in multi-dimensions.

Optimization and Control,Probability
On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision Processes

Bruno Scherrer

DOI: https://doi.org/10.48550/arXiv.1203.5532

2012-03-31

Abstract:We consider infinite-horizon $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. We consider the algorithm Value Iteration and the sequence of policies $\pi_1,...,\pi_k$ it implicitely generates until some iteration $k$. We provide performance bounds for non-stationary policies involving the last $m$ generated policies that reduce the state-of-the-art bound for the last stationary policy $\pi_k$ by a factor $\frac{1-\gamma}{1-\gamma^m}$. In particular, the use of non-stationary policies allows to reduce the usual asymptotic performance bounds of Value Iteration with errors bounded by $\epsilon$ at each iteration from $\frac{\gamma}{(1-\gamma)^2}\epsilon$ to $\frac{\gamma}{1-\gamma}\epsilon$, which is significant in the usual situation when $\gamma$ is close to 1. Given Bellman operators that can only be computed with some error $\epsilon$, a surprising consequence of this result is that the problem of "computing an approximately optimal non-stationary policy" is much simpler than that of "computing an approximately optimal stationary policy", and even slightly simpler than that of "approximately computing the value of some fixed policy", since this last problem only has a guarantee of $\frac{1}{1-\gamma}\epsilon$.

Artificial Intelligence
On the Convergence of Optimal Actions for Markov Decision Processes and the Optimality of $(s,S)$ Inventory Policies

Eugene A. Feinberg,Mark E. Lewis

DOI: https://doi.org/10.48550/arXiv.1507.05125

2017-03-20

Abstract:This paper studies convergence properties of optimal values and actions for discounted and average-cost Markov Decision Processes (MDPs) with weakly continuous transition probabilities and applies these properties to the stochastic periodic-review inventory control problem with backorders, positive setup costs, and convex holding/backordering costs. The following results are established for MDPs with possibly noncompact action sets and unbounded cost functions: (i) convergence of value iterations to optimal values for discounted problems with possibly non-zero terminal costs, (ii) convergence of optimal finite-horizon actions to optimal infinite-horizon actions for total discounted costs, as the time horizon tends to infinity, and (iii) convergence of optimal discount-cost actions to optimal average-cost actions for infinite-horizon problems, as the discount factor tends to 1. Being applied to the setup-cost inventory control problem, the general results on MDPs imply the optimality of $(s,S)$ policies and convergence properties of optimal thresholds. In particular this paper analyzes the setup-cost inventory control problem without two assumptions often used in the literature: (a) the demand is either discrete or continuous or (b) the backordering cost is higher than the cost of backordered inventory if the amount of backordered inventory is large.

Optimization and Control
Revisiting approximate dynamic programming and its convergence

Ali Heydari

DOI: https://doi.org/10.1109/TCYB.2014.2314612

Abstract:Value iteration-based approximate/adaptive dynamic programming (ADP) as an approximate solution to infinite-horizon optimal control problems with deterministic dynamics and continuous state and action spaces is investigated. The learning iterations are decomposed into an outer loop and an inner loop. A relatively simple proof for the convergence of the outer-loop iterations to the optimal solution is provided using a novel idea with some new features. It presents an analogy between the value function during the iterations and the value function of a fixed-final-time optimal control problem. The inner loop is utilized to avoid the need for solving a set of nonlinear equations or a nonlinear optimization problem numerically, at each iteration of ADP for the policy update. Sufficient conditions for the uniqueness of the solution to the policy update equation and for the convergence of the inner-loop iterations to the solution are obtained. Afterwards, the results are formed as a learning algorithm for training a neurocontroller or creating a look-up table to be used for optimal control of nonlinear systems with different initial conditions. Finally, some of the features of the investigated method are numerically analyzed.
On Some Geometric Behavior of Value Iteration on the Orthant: Switching System Perspective

Donghwan Lee

2023-12-01

Abstract:In this paper, the primary goal is to offer additional insights into the value iteration through the lens of switching system models in the control community. These models establish a connection between value iteration and switching system theory and reveal additional geometric behaviors of value iteration in solving discounted Markov decision problems. Specifically, the main contributions of this paper are twofold: 1) We provide a switching system model of value iteration and, based on it, offer a different proof for the contraction property of the value iteration. 2) Furthermore, from the additional insights, new geometric behaviors of value iteration are proven when the initial iterate lies in a special region. We anticipate that the proposed perspectives might have the potential to be a useful tool, applicable in various settings. Therefore, further development of these methods could be a valuable avenue for future research.

Optimization and Control,Systems and Control
On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Bruno Scherrer,Boris Lesner

DOI: https://doi.org/10.48550/arXiv.1211.6898

2012-11-29

Abstract:We consider infinite-horizon stationary $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error $\epsilon$ at each iteration, it is well-known that one can compute stationary policies that are $\frac{2\gamma}{(1-\gamma)^2}\epsilon$-optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for computing non-stationary policies that can be up to $\frac{2\gamma}{1-\gamma}\epsilon$-optimal, which constitutes a significant improvement in the usual situation when $\gamma$ is close to 1. Surprisingly, this shows that the problem of "computing near-optimal non-stationary policies" is much simpler than that of "computing near-optimal stationary policies".

Machine Learning,Artificial Intelligence
On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes

Hyeong Soo Chang

2024-02-11

Abstract:A recent theoretical analysis of a corrected-version of a Monte-Carlo tree search (MCTS) method, so-called UCT, established an unexpected result, due to a great deal of empirical successes reported from heuristic usage of UCT with relevant adjustments for the problem domains in the literature, that its convergence rate in estimating the expected error relative to the optimal value of a finite-horizon Markov decision process (MDP) at an initial state is polynomial. We strengthen this dispiriting slow-convergence result by arguing within a simpler framework in the perspective of MDP, apart from the usual MCTS description, that just simpler UCB1 applied with the policy set as the arm set is actually competitive with or asymptotically faster than the corrected-version of UCT because of its logarithmic convergence-rate. We also point out that MCTS in general has the worst-case time and space complexities that depend on the state-set size which contradicts the original spirit of MCTS. Unless heuristically used, UCT-based MCTS has yet to have theoretical supports for its applicabilities.

Optimization and Control
On the Convergence of Reinforcement Learning with Monte Carlo Exploring Starts

Jun Liu

DOI: https://doi.org/10.48550/arXiv.2007.10916

2020-07-22

Abstract:A basic simulation-based reinforcement learning algorithm is the Monte Carlo Exploring States (MCES) method, also known as optimistic policy iteration, in which the value function is approximated by simulated returns and a greedy policy is selected at each iteration. The convergence of this algorithm in the general setting has been an open question. In this paper, we investigate the convergence of this algorithm for the case with undiscounted costs, also known as the stochastic shortest path problem. The results complement existing partial results on this topic and thereby helps further settle the open problem. As a side result, we also provide a proof of a version of the supermartingale convergence theorem commonly used in stochastic approximation.

Optimization and Control,Machine Learning
Acceleration Operators in the Value Iteration Algorithms for Average Reward Markov Decision Processes

Oleksandr Shlakhter,Chi-Guhn Lee

DOI: https://doi.org/10.48550/arXiv.0806.0320

2008-06-03

Abstract:One of the most widely used methods for solving average cost MDP problems is the value iteration method. This method, however, is often computationally impractical and restricted in size of solvable MDP problems. We propose acceleration operators that improve the performance of the value iteration for average reward MDP models. These operators are based on two important properties of Markovian operator: contraction mapping and monotonicity. It is well known that the classical relative value iteration methods for average cost criteria MDP do not involve the max-norm contraction or monotonicity property. To overcome this difficulty we propose to combine acceleration operators with variants of value iteration for stochastic shortest path problems associated average reward problems.

Optimization and Control
Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization

Geethu Joseph

DOI: https://doi.org/10.1109/lsp.2024.3393352

2024-05-03

IEEE Signal Processing Letters

Abstract:The convergence of expectation-maximization (EM)-based algorithms typically requires continuity of the likelihood function with respect to all the unknown parameters (optimization variables). The requirement is not met when parameters comprise both discrete and continuous variables, making the convergence analysis nontrivial. This paper introduces a set of conditions that ensure the convergence of a specific class of EM algorithms that estimate a mixture of discrete and continuous parameters. Our results offer a new analysis technique for iterative algorithms that solve mixed-integer non-linear optimization problems. As a concrete example, we prove the convergence of an existing EM-based sparse Bayesian learning algorithm that estimates the state of a linear dynamical system with jointly sparse inputs and bursty missing observations. Our results establish that the algorithm converges to the set of stationary points of the maximum likelihood cost with respect to the continuous optimization variables.

engineering, electrical & electronic
A Multi-Criteria Value Iteration Algorithm For Pomdp Problems

Feng Liu,Tao Zheng,Xia Hua

DOI: https://doi.org/10.1109/SSCI.2016.7849372

2016-01-01

Abstract:Point-based value iteration algorithms have been deeply studied for solving POMDP problems. However, most of these algorithms explore the belief point set only by single heuristic criterion, thus limit the effectiveness. A novel value iteration algorithm (MCVI) based on multi-criteria for exploring belief point set is presented in the paper. MCVI filters the belief points on which the interval between upper and lower bounds of value function is less than the threshold, and then explores the successor belief point which is farthest away from the explored belief point set. MCVI can improve the effect and efficiency of convergence by guaranteeing that the explored point set is effective and fully distributed in the reachable belief space. Experiment results of four benchmarks show that MCVI can obtain better global optimal solution.
Value Iteration is Optic Composition

Jules Hedges,Riu Rodríguez Sakamoto

DOI: https://doi.org/10.4204/EPTCS.380.24

2023-07-31

Abstract:Dynamic programming is a class of algorithms used to compute optimal control policies for Markov decision processes. Dynamic programming is ubiquitous in control theory, and is also the foundation of reinforcement learning. In this paper, we show that value improvement, one of the main steps of dynamic programming, can be naturally seen as composition in a category of optics, and intuitively, the optimal value function is the limit of a chain of optic compositions. We illustrate this with three classic examples: the gridworld, the inverted pendulum and the savings problem. This is a first step towards a complete account of reinforcement learning in terms of parametrised optics.

Category Theory,Optimization and Control
A Probability-Based Value Iteration on Optimal Policy Algorithm for POMDP

Feng LIU,Chong-jun WANG,Bin LUO

DOI: https://doi.org/10.3969/j.issn.0372-2112.2016.05.010

2016-01-01

Abstract:With the enlargement of the scale of POMDP problems in applications,the research of heuristic methods for reachable area based on the optimal policy becomes current hotspot.However,the standard of existing algorithms about choosing the best action is not perfect enough thus the efficiency of the algorithms is affected.This paper proposes a new value iteration method PBVIOP (Probability-based Value Iteration on Optimal Policy).In depth-first heuristic exploration,this method uses the Monte Carlo algorithm to calculate the probability of each optimal action according to the distribution of each action′s Q function value between its upper and lower bounds,and chooses the maximum probability action.Experiment results of four benchmarks show that PBVIOP algorithm can obtain global optimal solution and significantly improve the convergence efficiency.

On Value Iteration Convergence in Connected MDPs

Convergence Analysis of an Incremental Approach to Online Inverse Reinforcement Learning

On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation

Asynchronous value iteration for markov decision processes with continuous state spaces

Incremental Value Iteration for Time-Aggregated Markov-Decision Processes

On the Convergence of Modified Policy Iteration in Risk Sensitive Exponential Cost Markov Decision Processes

Deflated Dynamics Value Iteration

A policy iteration algorithm for non-Markovian control problems

On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision Processes

On the Convergence of Optimal Actions for Markov Decision Processes and the Optimality of $(s,S)$ Inventory Policies

Revisiting approximate dynamic programming and its convergence

On Some Geometric Behavior of Value Iteration on the Orthant: Switching System Perspective

On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes

On the Convergence of Reinforcement Learning with Monte Carlo Exploring Starts

Acceleration Operators in the Value Iteration Algorithms for Average Reward Markov Decision Processes

Convergence of Expectation-Maximization Algorithm With Mixed-Integer Optimization

A Multi-Criteria Value Iteration Algorithm For Pomdp Problems

Value Iteration is Optic Composition

A Probability-Based Value Iteration on Optimal Policy Algorithm for POMDP