Abstract:Abstract Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been achieved toward characterizing the minimax-optimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. $S^6A^4 \,\mathrm{poly}(H)$ for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity $O(SAH)$, that achieves near-optimal regret as soon as the sample size exceeds the order of $SA\,\mathrm{poly}(H)$. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves—by at least a factor of $S^5A^3$—upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation trade-offs.

Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling

Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs

Achieving $\tilde{O}(1/ε)$ Sample Complexity for Constrained Markov Decision Process

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints

Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

Optimal Sample Complexity for Average Reward Markov Decision Processes

A safe exploration approach to constrained Markov decision processes

Learning to Optimize via Posterior Sampling

Truly No-Regret Learning in Constrained MDPs

Posterior Sampling for Continuing Environments

Model-free Posterior Sampling via Learning Rate Randomization

Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation

Prior-dependent analysis of posterior sampling reinforcement learning with function approximation

Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning

Optimistic Regret Bounds for Online Learning in Adversarial Markov Decision Processes

Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm

Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism