Abstract:Models of many real-life applications, such as queuing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter $\theta\in\Theta$, and defined on a countably-infinite state space $\mathcal X=\mathbb{Z}_+^d$, with finite action space $\mathcal A$, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter $\boldsymbol{\theta}^*$ generated via a given fixed prior distribution on $\Theta$. To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an $\tilde O(dh^d\sqrt{|\mathcal A|T})$ upper bound on the Bayesian regret of our algorithm, where $T$ is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queuing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the problem of designing effective control strategies in Markov Decision Processes (MDPs) with countably infinite state spaces. Specifically, the paper investigates methods for finding optimal control strategies in such MDPs when the model parameters are unknown. The core issues addressed in the paper include: 1. **Existing Challenges**: Many real-world applications, such as communication networks and supply chain management, can be described using queueing models with countably infinite state spaces. Although these models are often assumed to be known, developing optimal control schemes remains challenging. Current reinforcement learning and other data-driven optimal control methods are mainly designed for finite state environments or specific types of models and are not directly applicable to these complex queueing models. 2. **Research Objective**: The goal of the paper is to develop a meta-learning scheme that achieves good performance using methods similar to reinforcement learning when the model parameters are unknown. Specifically, the paper studies a class of discrete-time MDPs with countably infinite state spaces, controlled by an unknown parameter θ, where each MDP operates in the same countably infinite state space. 3. **Algorithm Design**: To achieve the above objective, the paper proposes an algorithm based on Thompson Sampling, which employs episodes of dynamic length. At the beginning of each episode, the posterior distribution is updated using Bayes' rule, and a parameter estimate is drawn from this distribution, which determines the strategy used in that episode. 4. **Theoretical Results**: The paper analyzes the Bayesian regret of the proposed algorithm and provides upper bound estimates under different scenarios. For example, the paper gives upper bounds on the Bayesian regret when considering all strategies or only a subset of strategies in the strategy set. 5. **Application Examples**: The paper also provides two queueing models as application examples to demonstrate the effectiveness of the proposed algorithm. These two models are: a two-server queueing system with a common buffer and two heterogeneous parallel queues. For these models, the paper verifies that they meet the assumptions required by the proposed algorithm and demonstrates that the algorithm can be used to learn approximately optimal control strategies. In summary, the paper addresses the problem of designing effective control strategies in MDPs with countably infinite state spaces and unknown parameters, and demonstrates the effectiveness and applicability of the algorithm through theoretical analysis and practical application cases.

Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space

Acting in Delayed Environments with Non-Stationary Markov Policies

A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

Approximate Policy Iteration for Robust Stochastic Control of Multi-agent Markov Decision Processes

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Optimal Policies for Quantum Markov Decision Processes

Primal-Dual Regression Approach for Markov Decision Processes with General State and Action Spaces

Mean Field Markov Decision Processes

Bounding Procedures for Stochastic Dynamic Programs with Application to the Perimeter Patrol Problem

Transition Constrained Bayesian Optimization via Markov Decision Processes

Maximizing the probability of visiting a set infinitely often for a Markov decision process with Borel state and action spaces

Online Markov decision processes with Kullback-Leibler control cost

Asymptotically Optimal Policies for Weakly Coupled Markov Decision Processes

Offline Bayesian Aleatoric and Epistemic Uncertainty Quantification and Posterior Value Optimisation in Finite-State MDPs

Controlled Markov Processes With Safety State Constraints

Markov Decision Processes under External Temporal Processes

On Bellman's principle of optimality and Reinforcement learning for safety-constrained Markov decision process

Episodic Bayesian Optimal Control with Unknown Randomness Distributions

Online Markov decision processes with policy iteration

The Bayesian process control with multiple assignable causes