Abstract:We consider the problem of learning in a non-stationary reinforcement learning (RL) environment, where the setting can be fully described by a piecewise stationary discrete-time Markov decision process (MDP). We introduce a variant of the Restarted Bayesian Online Change-Point Detection algorithm (R-BOCPD) that operates on input streams originating from the more general multinomial distribution and provides near-optimal theoretical guarantees in terms of false-alarm rate and detection delay. Based on this, we propose an improved version of the UCRL2 algorithm for MDPs with state transition kernel sampled from a multinomial distribution, which we call R-BOCPD-UCRL2. We perform a finite-time performance analysis and show that R-BOCPD-UCRL2 enjoys a favorable regret bound of $O\left(D O \sqrt{A T K_T \log\left (\frac{T}{\delta} \right) + \frac{K_T \log \frac{K_T}{\delta}}{\min\limits_\ell \: \mathbf{KL}\left( {\mathbf{\theta}^{(\ell+1)}}\mid\mid{\mathbf{\theta}^{(\ell)}}\right)}}\right)$, where $D$ is the largest MDP diameter from the set of MDPs defining the piecewise stationary MDP setting, $O$ is the finite number of states (constant over all changes), $A$ is the finite number of actions (constant over all changes), $K_T$ is the number of change points up to horizon $T$, and $\mathbf{\theta}^{(\ell)}$ is the transition kernel during the interval $[c_\ell, c_{\ell+1})$, which we assume to be multinomially distributed over the set of states $\mathbb{O}$. Interestingly, the performance bound does not directly scale with the variation in MDP state transition distributions and rewards, ie. can also model abrupt changes. In practice, R-BOCPD-UCRL2 outperforms the state-of-the-art in a variety of scenarios in synthetic environments. We provide a detailed experimental setup along with a code repository (upon publication) that can be used to easily reproduce our experiments.

Online Reinforcement Learning for Periodic MDP

Online Reinforcement Learning in Periodic MDP

Periodic Guidance Learning

Periodic agent-state based Q-learning for POMDPs

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

Fundamental Limits of Reinforcement Learning in Environment with Endogeneous and Exogeneous Uncertainty

Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

Harnessing Causality in Reinforcement Learning With Bagged Decision Times

Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes

Nonstationary Reinforcement Learning with Linear Function Approximation

Efficient Online Learning with Offline Datasets for Infinite Horizon MDPs: A Bayesian Approach

Continuous-Time Markov Decision Process With Average Reward: Using Reinforcement Learning Method

Monitored Markov Decision Processes

Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

Provably Efficient UCB-type Algorithms For Learning Predictive State Representations

Reinforcement Learning for Omega-Regular Specifications on Continuous-Time MDP

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Statistical Guarantees for Lifelong Reinforcement Learning using PAC-Bayesian Theory

Markov Decision Processes with Continuous Side Information