Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space

Saghar Adler,Vijay Subramanian
2024-03-17
Abstract:Models of many real-life applications, such as queuing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter $\theta\in\Theta$, and defined on a countably-infinite state space $\mathcal X=\mathbb{Z}_+^d$, with finite action space $\mathcal A$, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter $\boldsymbol{\theta}^*$ generated via a given fixed prior distribution on $\Theta$. To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an $\tilde O(dh^d\sqrt{|\mathcal A|T})$ upper bound on the Bayesian regret of our algorithm, where $T$ is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queuing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.
Systems and Control,Machine Learning
What problem does this paper attempt to address?
The paper primarily focuses on addressing the problem of designing effective control strategies in Markov Decision Processes (MDPs) with countably infinite state spaces. Specifically, the paper investigates methods for finding optimal control strategies in such MDPs when the model parameters are unknown. The core issues addressed in the paper include: 1. **Existing Challenges**: Many real-world applications, such as communication networks and supply chain management, can be described using queueing models with countably infinite state spaces. Although these models are often assumed to be known, developing optimal control schemes remains challenging. Current reinforcement learning and other data-driven optimal control methods are mainly designed for finite state environments or specific types of models and are not directly applicable to these complex queueing models. 2. **Research Objective**: The goal of the paper is to develop a meta-learning scheme that achieves good performance using methods similar to reinforcement learning when the model parameters are unknown. Specifically, the paper studies a class of discrete-time MDPs with countably infinite state spaces, controlled by an unknown parameter θ, where each MDP operates in the same countably infinite state space. 3. **Algorithm Design**: To achieve the above objective, the paper proposes an algorithm based on Thompson Sampling, which employs episodes of dynamic length. At the beginning of each episode, the posterior distribution is updated using Bayes' rule, and a parameter estimate is drawn from this distribution, which determines the strategy used in that episode. 4. **Theoretical Results**: The paper analyzes the Bayesian regret of the proposed algorithm and provides upper bound estimates under different scenarios. For example, the paper gives upper bounds on the Bayesian regret when considering all strategies or only a subset of strategies in the strategy set. 5. **Application Examples**: The paper also provides two queueing models as application examples to demonstrate the effectiveness of the proposed algorithm. These two models are: a two-server queueing system with a common buffer and two heterogeneous parallel queues. For these models, the paper verifies that they meet the assumptions required by the proposed algorithm and demonstrates that the algorithm can be used to learn approximately optimal control strategies. In summary, the paper addresses the problem of designing effective control strategies in MDPs with countably infinite state spaces and unknown parameters, and demonstrates the effectiveness and applicability of the algorithm through theoretical analysis and practical application cases.