Abstract:Models of many real-life applications, such as queuing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter $\theta\in\Theta$, and defined on a countably-infinite state space $\mathcal X=\mathbb{Z}_+^d$, with finite action space $\mathcal A$, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter $\boldsymbol{\theta}^*$ generated via a given fixed prior distribution on $\Theta$. To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an $\tilde O(dh^d\sqrt{|\mathcal A|T})$ upper bound on the Bayesian regret of our algorithm, where $T$ is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queuing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.

A Rollout Algorithm For Multichain Markov Decision Processes With Average Cost

Constrained Multiagent Rollout and Multidimensional Assignment with the Auction Algorithm

Decentralised Q-Learning for Multi-Agent Markov Decision Processes with a Satisfiability Criterion

A Simulation Optimization Algorithm for CTMDPs Based on Randomized Stationary Policies

Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space

A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost

Relative Q-Learning for Average-Reward Markov Decision Processes with Continuous States

Multi-token Markov Game with Switching Costs

Optimization of Joint Replacement Policies for Multipart Systems by a Rollout Framework

Optimal Sample Complexity for Average Reward Markov Decision Processes

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Rollout Heuristics for Online Stochastic Contingent Planning

Simulation-Based optimization of singularly perturbed markov reward processes with states aggregation

A Structure-aware Online Learning Algorithm for Markov Decision Processes

Simulation Optimization Algorithm for SMDPs with Parameterized Randomized Stationary Policies

Characterization of the optimal average cost in Markov decision chains driven by a risk-seeking controller

Lifted-Rollout for Approximate Policy Iteration of Markov Decision Process

Rollout Strategies for Real-Time Multi-Energy Scheduling in Microgrid with Storage System

A rollout method for finite-stage event-based decision processes

Combinations and Mixtures of Optimal Policies in Unichain Markov Decision Processes are Optimal

A safe exploration approach to constrained Markov decision processes