Abstract:Models of many real-life applications, such as queuing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter $\theta\in\Theta$, and defined on a countably-infinite state space $\mathcal X=\mathbb{Z}_+^d$, with finite action space $\mathcal A$, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter $\boldsymbol{\theta}^*$ generated via a given fixed prior distribution on $\Theta$. To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an $\tilde O(dh^d\sqrt{|\mathcal A|T})$ upper bound on the Bayesian regret of our algorithm, where $T$ is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queuing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.

Primal-Dual Regression Approach for Markov Decision Processes with General State and Action Spaces

Convergence Rate of Primal-Dual Approach to Constrained Reinforcement Learning with Softmax Policy

Accelerating Primal-Dual Methods for Regularized Markov Decision Processes

Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space

Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning

Cooperative Multi-Agent Constrained POMDPs: Strong Duality and Primal-Dual Reinforcement Learning with Approximate Information States

Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees

Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction

Duality Between Large Deviation Control and Risk-Sensitive Control for Markov Decision Processes.

Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm

A primal-dual perspective for distributed TD-learning

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

Solving Robust MDPs through No-Regret Dynamics

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning

On Maximizing Probabilities for Over-Performing a Target for Markov Decision Processes

Truly No-Regret Learning in Constrained MDPs

Solving the Dual Problems of Dynamic Programs via Regression