Abstract:We study reinforcement learning (RL) in episodic MDPs with adversarial full-information losses and the unknown transition. Instead of the classical static regret, we adopt dynamic regret as the performance measure which benchmarks the learner's performance with changing policies, making it more suitable for non-stationary environments. The primary challenge is to handle the uncertainties of unknown transition and unknown non-stationarity of environments simultaneously. We propose a general framework to decouple the two sources of uncertainties and show the dynamic regret bound naturally decomposes into two terms, one due to constructing confidence sets to handle the unknown transition and the other due to choosing sub-optimal policies under the unknown non-stationarity. To this end, we first employ the two-layer online ensemble structure to handle the adaptation error due to the unknown non-stationarity, which is model-agnostic. Subsequently, we instantiate the framework to three fundamental MDP models, including tabular MDPs, linear MDPs and linear mixture MDPs, and present corresponding approaches to control the exploration error due to the unknown transition. We provide dynamic regret guarantees respectively and show they are optimal in terms of the number of episodes K and the non-stationarity P̄ᴋ by establishing matching lower bounds. To the best of our knowledge, this is the first work that achieves the dynamic regret exhibiting optimal dependence on K and P̄ᴋ without prior knowledge about the non-stationarity for adversarial MDPs with unknown transition.

Cascaded Gaps: Towards Gap-Dependent Regret for Risk-Sensitive Reinforcement Learning

Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

Regret Bounds for Risk-sensitive Reinforcement Learning with Lipschitz Dynamic Risk Measures

Exponential Bellman Equation and Improved Regret Bounds for Risk-Sensitive Reinforcement Learning

Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition

Regret Bounds for Markov Decision Processes with Recursive Optimized Certainty Equivalents

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

Taming Equilibrium Bias in Risk-Sensitive Multi-Agent Reinforcement Learning

RM-FSP: Regret Minimization Optimizes Neural Fictitious Self-Play

Dynamic Regret of Online Markov Decision Processes

Fundamental Limits of Reinforcement Learning in Environment with Endogeneous and Exogeneous Uncertainty

Dynamic Regret of Adversarial MDPs with Unknown Transition and Linear Function Approximation

Regret Bounds for Episodic Risk-Sensitive Linear Quadratic Regulator

Square-root regret bounds for continuous-time episodic Markov decision processes

$\Sqrt{n}$-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret

Settling Constant Regrets in Linear Markov Decision Processes

Efficient Risk-Averse Reinforcement Learning

Dynamic Regret of Policy Optimization in Non-stationary Environments

Logarithmic regret bounds for continuous-time average-reward Markov decision processes