Abstract:For a reinforcement learning method to be useful, the policy it estimates in the limit must be superior to the initial guess, at least on average. In this work, we show that the widely used Deep Q-Network (DQN) fails to meet even this basic criterion, even when it gets to see all possible states and actions infinitely often (a condition that ensures tabular Q-learning's convergence to the optimal Q-value). Our work's key highlights are as follows. First, we numerically show that DQN generally has a non-trivial probability of producing a policy worse than the initial one. Second, we give a theoretical explanation for this behavior in the context of linear DQN, wherein we replace the neural network with a linear function approximation but retain DQN's other key ideas, such as experience replay, target network, and $\epsilon$-greedy exploration. Our main result is that the tail behaviors of linear DQN are governed by invariant sets of a deterministic differential inclusion, a set-valued generalization of a differential equation. Notably, we show that these invariant sets need not align with locally optimal policies, thus explaining DQN's pathological behaviors, such as convergence to sub-optimal policies and policy oscillation. We also provide a scenario where the limiting policy is always the worst. Our work addresses a longstanding gap in understanding the behaviors of Q-learning with function approximation and $\epsilon$-greedy exploration.

Q-learning as a monotone scheme

Gradient Q(σ, Λ): A Unified Algorithm with Function Approximation for Reinforcement Learning

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Unified ODE Analysis of Smooth Q-Learning Algorithms

Stability of Q-Learning Through Design and Optimism

Deep Q-Learning: Theoretical Insights from an Asymptotic Analysis

Finite-Time Analysis of Asynchronous Q-Learning Under Diminishing Step-Size From Control-Theoretic View

Convex Q Learning in a Stochastic Environment: Extended Version

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Final Iteration Convergence Bound of Q-Learning: Switching System Approach

Inverse Value Iteration and Q -Learning: Algorithms, Stability, and Robustness

Does DQN Learn?

Constant Stepsize Q-learning: Distributional Convergence, Bias and Extrapolation

Beyond Strict Competition: Approximate Convergence of Multi Agent Q-Learning Dynamics

Continuous-time q-learning for mean-field control problems

Safe Q-learning for continuous-time linear systems

Sufficient Exploration for Convex Q-learning

Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems

Neural Q-learning for discrete-time nonlinear zero-sum games with adjustable convergence rate

Easy Monotonic Policy Iteration