Abstract:In this paper, we propose a single sample path based algorithm with state aggregation to optimize the average rewards of singularly perturbed Markov reward processes (SPMRPs) with a large scale state spaces. It is assumed that such a reward process depend on a set of parameters. Differing from the other kinds of Markov chain, SPMRPs have their own hierarchical structure. Based on this special structure, our algorithm can alleviate the load in the optimization for performance. Moreover, our method can be applied on line because of its evolution with the sample path simulated. Compared with the original algorithm applied on these problems of general MRPs, a new gradient formula for average reward performance metric in SPMRPs is brought in, which will be proved in Appendix, and then based on these gradients, the schedule of the iteration algorithm is presented, which is based on a single sample path, and eventually a special case in which parameters only dominate the disturbance matrices will be analyzed, and a precise comparison with be displayed between our algorithm with the old ones which is aim to solve these problems in general Markov reward processes. When applied in SPMRPs, our method will approach a fast pace in these cases. Furthermore, to illustrate the practical value of SPMRPs, a simple example in multiple programming in computer systems will be listed and simulated. Corresponding to some practical model, physical meanings of SPMRPs in networks of queues will be clarified. Keywords—Singularly perturbed Markov processes, Gradient of average reward, Differential reward, State aggregation, Perturbed close network.

A Basic Formula for Performance Gradient Estimation of Semi-Markov Decision Processes

Infinite-horizon gradient estimation for semi-Markov decision processes

On performance potentials and conditional Monte Carlo for gradient estimationfor Markov chains

On-Line Policy Gradient Estimation with Multi-Step Sampling.

Performance Optimization of Semi-Markov Decision Processes with Discounted-cost Criteria.

Reinforcement learning algorithms for semi-Markov decision processes with average reward

A unified approach for semi-Markov decision processes with discounted and average reward criteria

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

An Inverse Reinforcement Learning Algorithm for Semi-Markov Decision Processes

Two-Timescale Simulation-based Algorithm for Markov Decision Process Based on Performance Potentials

An improvement of policy gradient estimation algorithms

Partially Observable Markov Decision Processes and Performance Sensitivity Analysis

Simulation Optimization Algorithm for SMDPs with Parameterized Randomized Stationary Policies

Recursive Approaches for Single Sample Path Based Markov Reward Processes

A Unified Approach to Markov Decision Problems and Performance Sensitivity Analysis with Discounted and Average Criteria: Multichain Cases

A State Aggregation Approach to Singularly Perturbed Markov Reward Processes

RVI reinforcement learning for semi-Markov decision processes with average reward

Error bounds of optimization algorithms for semi-Markov decision processes

A Survey on Semi-Markov Decision Processes

An Average Reward Performance Potential Estimation with Geometric Variance Reduction

Average Reward Reinforcement Learning For Semi-Markov Decision Processes