Abstract:Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a $\gamma $ -discounted MDP with state space $\mathcal {S}$ and action space $\mathcal {A}$ , we demonstrate that the $\ell _{\infty }$ -based sample complexity of classical asynchronous Q-learning — namely, the number of samples needed to yield an entrywise $\varepsilon $ -accurate estimate of the Q-function — is at most on the order of $\frac {1}{ \mu _{\mathsf {min}}(1-\gamma)^{5}\varepsilon ^{2}}+ \frac { t_{\mathsf {mix}}}{ \mu _{\mathsf {min}}(1-\gamma)}$ up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, $t_{\mathsf {mix}}$ and $\mu _{\mathsf {min}}$ denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the sample complexity in the synchronous case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the cost taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the state-of-the-art result by a factor of at least $|\mathcal {S}||\mathcal {A}|$ for all scenarios, and by a factor of at least $t_{\mathsf {mix}}|\mathcal {S}||\mathcal {A}|$ for any sufficiently small accuracy level $\varepsilon $ . Further, we demonstrate that the scaling on the effective horizon $\frac {1}{1-\gamma }$ can be improved by means of variance reduction.

Finite-Time Analysis of Asynchronous Q-Learning Under Diminishing Step-Size From Control-Theoretic View

Asynchronous Finite-Time H<inf>&#x221E;</inf> Control for a Class of Nonlinear Switched Time-Delay Systems

Unified ODE Analysis of Smooth Q-Learning Algorithms

Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach

Final Iteration Convergence Bound of Q-Learning: Switching System Approach

Finite-Time Analysis of Minimax Q-Learning for Two-Player Zero-Sum Markov Games: Switching System Approach

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

Finite-Time Analysis of Simultaneous Double Q-learning

Deep Q-Learning: Theoretical Insights from an Asymptotic Analysis

Constant Stepsize Q-learning: Distributional Convergence, Bias and Extrapolation

Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

Finite-Time Analysis of Temporal Difference Learning: Discrete-Time Linear System Perspective

Finite-Time Error Analysis of Online Model-Based Q-Learning with a Relaxed Sampling Model

A Q-Learning Algorithm for Discrete-Time Linear-Quadratic Control with Random Parameters of Unknown Distribution: Convergence and Stabilization

Neural Q-learning for discrete-time nonlinear zero-sum games with adjustable convergence rate

Stochastic LQ optimal control for Markov jumping systems with multiplicative noise using reinforcement learning

Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling

Deep Q-Learning with Low Switching Cost

Suboptimality analysis of receding horizon quadratic control with unknown linear systems and its applications in learning-based control

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks