Abstract:Abstract Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been achieved toward characterizing the minimax-optimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. $S^6A^4 \,\mathrm{poly}(H)$ for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity $O(SAH)$, that achieves near-optimal regret as soon as the sample size exceeds the order of $SA\,\mathrm{poly}(H)$. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves—by at least a factor of $S^5A^3$—upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation trade-offs.

Regret Minimization For Reinforcement Learning By Evaluating The Optimal Bias Function

Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Settling Constant Regrets in Linear Markov Decision Processes

Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs

Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Horizon-Free Regret for Linear Markov Decision Processes

Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning

Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation

$\Sqrt{n}$-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Almost Optimal Model-Free Reinforcement Learning Via Reference-Advantage Decomposition

Regret Minimization for Partially Observable Deep Reinforcement Learning

Fundamental Limits of Reinforcement Learning in Environment with Endogeneous and Exogeneous Uncertainty

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

√N-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank.

Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

Online Reinforcement Learning in Markov Decision Process Using Linear Programming