Abstract:Models of many real-life applications, such as queuing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter $\theta\in\Theta$, and defined on a countably-infinite state space $\mathcal X=\mathbb{Z}_+^d$, with finite action space $\mathcal A$, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter $\boldsymbol{\theta}^*$ generated via a given fixed prior distribution on $\Theta$. To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode. To ensure the stability of the Markov chain obtained by following the policy chosen for each parameter, we impose ergodicity assumptions. From this condition and using the solution of the average cost Bellman equation, we establish an $\tilde O(dh^d\sqrt{|\mathcal A|T})$ upper bound on the Bayesian regret of our algorithm, where $T$ is the time-horizon. Finally, to elucidate the applicability of our algorithm, we consider two different queuing models with unknown dynamics, and show that our algorithm can be applied to develop approximately optimal control algorithms.

Entropy Rate Maximization of Markov Decision Processes under Linear Temporal Logic Tasks

Entropy Rate Maximization of Markov Decision Processes for Surveillance Tasks

Markov decision processes with maximum entropy rate for Surveillance Tasks

Transfer Entropy in MDPs with Temporal Logic Specifications

Entropy Maximization for Partially Observable Markov Decision Processes

State Entropy Optimization in Markov Decision Processes

Synthesis of Discounted-Reward Optimal Policies for Markov Decision Processes Under Linear Temporal Logic Specifications

Unpredictable Planning Under Partial Observability

Control of Probabilistic Systems under Dynamic, Partially Known Environments with Temporal Logic Specifications

Optimal Control Synthesis of Markov Decision Processes for Efficiency with Surveillance Tasks

Optimal Control of Logically Constrained Partially Observable and Multi-Agent Markov Decision Processes

Sample Efficient Model-free Reinforcement Learning from LTL Specifications with Optimality Guarantees

Reinforcement Learning for Temporal Logic Control Synthesis with Probabilistic Satisfaction Guarantees

On the Complexity of Computing Maximum Entropy for Markovian Models.

Probabilistic Planning with Prioritized Preferences over Temporal Logic Objectives

Optimal Time-Abstract Schedulers for CTMDPs and Markov Games

Reinforcement Learning Based Temporal Logic Control with Maximum Probabilistic Satisfaction

The Limits of Pure Exploration in POMDPs: When the Observation Entropy is Enough

Sample-Efficient Reinforcement Learning with Temporal Logic Objectives: Leveraging the Task Specification to Guide Exploration

Stochastic Finite State Control of POMDPs with LTL Specifications

Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space