Settling the Sample Complexity of Online Reinforcement Learning

Zihan Zhang,Yuxin Chen,Jason D. Lee,Simon S. Du
2024-05-24
Abstract:A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory.
Machine Learning
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is the sample complexity problem in online reinforcement learning (RL), especially how to achieve minimax - optimal regret without any burn - in cost. Specifically, for finite - horizon inhomogeneous Markov decision processes (MDPs), the paper proposes an improved Monotonic Value Propagation (MVP) algorithm and proves that this algorithm can reach the optimal regret bound for all sample sizes \(K \geq 1\). ### Main contributions of the paper 1. **Determining the optimal sample complexity**: - Through the proposed improved MVP algorithm, the paper proves that the optimal regret bound can be achieved without any burn - in cost. Specifically, for any \(K \geq 1\), the regret bound of the algorithm is: \[ \text{Regret}(K) \lesssim \min\left\{\sqrt{SAH^3K \log^5(SAHK/\delta)}, HK\right\} \] This result matches the known minimax lower bound, thus eliminating the need for burn - in cost. 2. **Extension to problem - dependent regret bounds**: - The paper further explores the influence of the optimal value, the optimal cost, and some variance quantities on the regret bound. For example, for the optimal - value - dependent regret bound, we have: \[ E[\text{Regret}(K)] \lesssim \min\left\{\sqrt{SAH^2Kv^\star}, Kv^\star\right\} \log^5(SAHK) \] where \(v^\star\) is the average value of the optimal policy on the initial state distribution. - For the optimal - cost - dependent regret bound, we have: \[ \text{Regret}(K) \leq \tilde{O}\left(\min\left\{\sqrt{SAH^2Kc^\star}+ SAH^2, K(H - c^\star)\right\}\right) \] where \(c^\star\) is the average cost of the optimal policy on the initial state distribution. - For the variance - dependent regret bound, we have: \[ \text{Regret}(K) \leq \tilde{O}\left(\min\left\{\sqrt{SAHK \text{var}}+ SAH^2, KH\right\}\right) \] where \(\text{var}\) is a measure of some variance type. ### Key technological innovation The key technology in the paper is the introduction of a new analysis paradigm, which is based on the concept of "profile" to decouple complex statistical dependencies, which is a long - standing challenge in the sample - starved regions of online RL. Through this method, the author can effectively handle highly subsampled regions (i.e., the case where the sample size is linearly dependent on the number of states \(S\)), thus avoiding unnecessary burn - in cost. ### Conclusion In general, this paper solves a long - standing unresolved problem in online reinforcement learning, that is, how to achieve optimal sample complexity without any burn - in cost. This result is not only of great theoretical significance but also provides new ideas for efficient data utilization in practical applications.