Abstract:Models of many real-life applications, such as queuing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter $\theta\in\Theta$, and defined on a countably-infinite state space $\mathcal X=\mathbb{Z}_+^d$, with finite action space $\mathcal A$, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter $\boldsymbol{\theta}^*$ generated via a given fixed prior distribution on $\Theta$. To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an $\tilde O(dh^d\sqrt{|\mathcal A|T})$ upper bound on the Bayesian regret of our algorithm, where $T$ is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queuing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.

Markov decision Processes with fractional costs

Average cost Markov decision processes with countable state spaces

Markov Decision Processes with State-Dependent Discount Factors and Unbounded Rewards/costs.

Optimal Policies for Quantum Markov Decision Processes

Performance Optimization of Semi-Markov Decision Processes with Discounted-cost Criteria.

Markov Decision Processes with Time-Varying Geometric Discounting

State Aggregation In Markov Decision Processes

Mixed Markov Decision Processes in a Semi-Markov Environment with Discounted Criterion

Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space

Markov Decision Problems with Unbounded Transition Rates under Discounted-Cost Performance Criteria

Risk-sensitive Average Continuous-Time Markov Decision Processes with Unbounded Transition and Cost Rates.

Beyond discounted returns: Robust Markov decision processes with average and Blackwell optimality

Nonstationary Denumerable State Markov Decision Processes – with Average Variance Criterion

Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes

Mean Field Markov Decision Processes

Constrained Risk-Averse Markov Decision Processes

Optimization of Markov Decision Processes under the Variance Criterion

Risk-Sensitive Average Markov Decision Processes in General Spaces

Convergence of Markov Decision Processes with Constraints and State-Action Dependent Discount Factors

Online Markov decision processes with policy iteration

Measurized Markov Decision Processes Part I: The Discounted Infinite Horizon Criterion