Abstract:Temporal difference (TD) learning algorithms with neural network function parameterization have well-established empirical success in many practical large-scale reinforcement learning tasks. However, theoretical understanding of these algorithms remains challenging due to the nonlinearity of the action-value approximation. In this paper, we develop an improved non-asymptotic analysis of the neural TD method with a general $L$-layer neural network. New proof techniques are developed and an improved new $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity is derived. To our best knowledge, this is the first finite-time analysis of neural TD that achieves an $\tilde{\mathcal{O}}(\epsilon^{-1})$ complexity under the Markovian sampling, as opposed to the best known $\tilde{\mathcal{O}}(\epsilon^{-2})$ complexity in the existing literature.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the sample complexity of the temporal difference (TD) learning algorithm using deep neural networks. Specifically, the existing analyses of neural TD or neural Q - learning algorithms only provide a sample complexity of $\tilde{O}(\epsilon^{-2})$ under various settings, while theoretically a sample complexity of $\tilde{O}(\epsilon^{-1})$ should be expected. To this end, the paper re - examines the convergence analysis of the neural TD learning or Q - learning algorithm that parameterizes the Q - function using a general L - layer neural network under the non - independent and identically distributed (non - i.i.d. Markovian) sampling setting. By proposing a new subspace analysis technique, under appropriate conditions, the paper derives a $\tilde{O}(\epsilon^{-1})$ sample complexity for the neural TD learning or Q - learning method, thereby improving the best $\tilde{O}(\epsilon^{-2})$ sample complexity in the existing literature. ### Main contributions of the paper: 1. **Improvement of sample complexity**: Under the non - independent and identically distributed (non - i.i.d. Markovian) sampling setting, the paper derives a $\tilde{O}(\epsilon^{-1})$ sample complexity for the neural TD learning and Q - learning methods that parameterize the Q - function with a multi - layer neural network, improving the $\tilde{O}(\epsilon^{-2})$ sample complexity in the existing literature. 2. **Extension to two - player zero - sum Markov games**: Based on the newly developed techniques, the paper further provides a finite - sample analysis of the minimax neural Q - learning algorithm for solving two - player zero - sum Markov games and obtains a $\tilde{O}(\epsilon^{-1})$ sample complexity under the non - independent and identically distributed (non - i.i.d. Markovian) sampling setting. 3. **Technical contributions**: The proposed subspace analysis method is of independent interest in itself and can be applied to the linear Q - learning algorithm and the linear Actor - Critic algorithm without the positive - definiteness assumption of the feature covariance matrix while maintaining a complexity of $\tilde{O}(\epsilon^{-1})$. ### Comparison of sample complexity: | Method | Network depth | Network width | Activation function | Sample complexity | | ------ | ------ | ------ | ------ | ------ | | Bhandari et al. (2018) | None | None | None | $O(1/\epsilon)$ | | Cai et al. (2023) | 2 | $\Omega(1/\epsilon^4)$ | ReLU | $O(1/\epsilon^2)$ | | Xu & Gu (2020) | L | $\Omega(1/\epsilon^6)$ | ReLU | $O(1/\epsilon^2)$ | | Sun et al. (2022) | L | $\Omega(1/\epsilon^6)$ | ReLU | $O(1/\epsilon^{2/(2 - \alpha)})$, $\alpha\in(0,1]$ | | Tian et al. (2022) | L | $\Omega(1/\epsilon^2)$ | ELU, GeLU | $O(1/\epsilon^2)$ | | This paper | L | $\Omega(1/\epsilon^2)$ | ELU, GeLU | $O(1/\epsilon)$ | ### Conclusion: The paper successfully improves the sample complexity of neural TD learning and Q - learning from $\tilde{O}(\epsilon^{-2})$ to $\tilde{O}(\epsilon^{-1})$ by proposing a new subspace analysis technique.

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

Finite-Time Analysis of Adaptive Temporal Difference Learning with Deep Neural Networks

Finite-Time Analysis of Temporal Difference Learning: Discrete-Time Linear System Perspective

Finite-Time Bounds for AMSGrad-Enhanced Neural TD

Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation

Reanalysis of Variance Reduced Temporal Difference Learning

Statistical Inference for Temporal Difference Learning with Linear Function Approximation

A Non-asymptotic Analysis of Non-parametric Temporal-Difference Learning

Temporal Difference Learning with Experience Replay

A Simple Finite-Time Analysis of TD Learning with Linear Function Approximation

Decentralized Adaptive TD $(\lambda)$ Learning with Linear Function Approximation: Nonasymptotic Analysis

On the Statistical Benefits of Temporal Difference Learning

Provable distributed adaptive temporal-difference learning over time-varying networks

Target-Based Temporal Difference Learning

Almost Sure Convergence of Linear Temporal Difference Learning with Arbitrary Features

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

Improved High-Probability Bounds for the Temporal Difference Learning Algorithm via Exponential Stability

Differentially Private Temporal Difference Learning with Stochastic Nonconvex-Strongly-Concave Optimization

Near Minimax-Optimal Distributional Temporal Difference Algorithms and The Freedman Inequality in Hilbert Spaces

Why Target Networks Stabilise Temporal Difference Methods

Almost Sure Convergence of Average Reward Temporal Difference Learning