Abstract:We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver. We prove an $\tilde{O}(\sqrt{\mathsf{Var}^\star M \Gamma S A K})$ regret bound where $\tilde{O}$ hides logarithm factors, $M$ is the number of contexts, $S$ is the number of states, $A$ is the number of actions, $K$ is the number of episodes, $\Gamma \le S$ is the maximum transition degree of any state-action pair, and $\mathsf{Var}^\star$ is a variance quantity describing the determinism of the LMDP. The regret bound only scales logarithmically with the planning horizon, thus yielding the first (nearly) horizon-free regret bound for LMDP. This is also the first problem-dependent regret bound for LMDP. Key in our proof is an analysis of the total variance of alpha vectors (a generalization of value functions), which is handled with a truncation method. We complement our positive result with a novel $\Omega(\sqrt{\mathsf{Var}^\star M S A K})$ regret lower bound with $\Gamma = 2$, which shows our upper bound minimax optimal when $\Gamma$ is a constant for the class of variance-bounded LMDPs. Our lower bound relies on new constructions of hard instances and an argument inspired by the symmetrization technique from theoretical computer science, both of which are technically different from existing lower bound proof for MDPs, and thus can be of independent interest.

Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling

Learning to Optimize via Posterior Sampling

Thompson Sampling for Infinite-Horizon Discounted Decision Processes

Optimistic Regret Bounds for Online Learning in Adversarial Markov Decision Processes

Efficient and Adaptive Posterior Sampling Algorithms for Bandits

Generalized Regret Analysis of Thompson Sampling using Fractional Posteriors

Optimistic Thompson Sampling for No-Regret Learning in Unknown Games

Regret Minimization For Reinforcement Learning By Evaluating The Optimal Bias Function

On the Prior Sensitivity of Thompson Sampling

Posterior Sampling-based Online Learning for Episodic POMDPs

Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs

Prior-dependent analysis of posterior sampling reinforcement learning with function approximation

Posterior Sampling-Based Bayesian Optimization with Tighter Bayesian Regret Bounds

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning

Information-Theoretic Confidence Bounds for Reinforcement Learning

Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes

A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with an Arbitrary Opponent

Information-Theoretic Minimax Regret Bounds for Reinforcement Learning based on Duality