An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks

Zhifa Ke,Zaiwen Wen,Junyu Zhang
2024-05-07
Abstract:Temporal difference (TD) learning algorithms with neural network function parameterization have well-established empirical success in many practical large-scale reinforcement learning tasks. However, theoretical understanding of these algorithms remains challenging due to the nonlinearity of the action-value approximation. In this paper, we develop an improved non-asymptotic analysis of the neural TD method with a general $L$-layer neural network. New proof techniques are developed and an improved new $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity is derived. To our best knowledge, this is the first finite-time analysis of neural TD that achieves an $\tilde{\mathcal{O}}(\epsilon^{-1})$ complexity under the Markovian sampling, as opposed to the best known $\tilde{\mathcal{O}}(\epsilon^{-2})$ complexity in the existing literature.
Machine Learning,Artificial Intelligence,Optimization and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the sample complexity of the temporal difference (TD) learning algorithm using deep neural networks. Specifically, the existing analyses of neural TD or neural Q - learning algorithms only provide a sample complexity of \(\tilde{O}(\epsilon^{-2})\) under various settings, while theoretically a sample complexity of \(\tilde{O}(\epsilon^{-1})\) should be expected. To this end, the paper re - examines the convergence analysis of the neural TD learning or Q - learning algorithm that parameterizes the Q - function using a general L - layer neural network under the non - independent and identically distributed (non - i.i.d. Markovian) sampling setting. By proposing a new subspace analysis technique, under appropriate conditions, the paper derives a \(\tilde{O}(\epsilon^{-1})\) sample complexity for the neural TD learning or Q - learning method, thereby improving the best \(\tilde{O}(\epsilon^{-2})\) sample complexity in the existing literature. ### Main contributions of the paper: 1. **Improvement of sample complexity**: Under the non - independent and identically distributed (non - i.i.d. Markovian) sampling setting, the paper derives a \(\tilde{O}(\epsilon^{-1})\) sample complexity for the neural TD learning and Q - learning methods that parameterize the Q - function with a multi - layer neural network, improving the \(\tilde{O}(\epsilon^{-2})\) sample complexity in the existing literature. 2. **Extension to two - player zero - sum Markov games**: Based on the newly developed techniques, the paper further provides a finite - sample analysis of the minimax neural Q - learning algorithm for solving two - player zero - sum Markov games and obtains a \(\tilde{O}(\epsilon^{-1})\) sample complexity under the non - independent and identically distributed (non - i.i.d. Markovian) sampling setting. 3. **Technical contributions**: The proposed subspace analysis method is of independent interest in itself and can be applied to the linear Q - learning algorithm and the linear Actor - Critic algorithm without the positive - definiteness assumption of the feature covariance matrix while maintaining a complexity of \(\tilde{O}(\epsilon^{-1})\). ### Comparison of sample complexity: | Method | Network depth | Network width | Activation function | Sample complexity | | ------ | ------ | ------ | ------ | ------ | | Bhandari et al. (2018) | None | None | None | \(O(1/\epsilon)\) | | Cai et al. (2023) | 2 | \(\Omega(1/\epsilon^4)\) | ReLU | \(O(1/\epsilon^2)\) | | Xu & Gu (2020) | L | \(\Omega(1/\epsilon^6)\) | ReLU | \(O(1/\epsilon^2)\) | | Sun et al. (2022) | L | \(\Omega(1/\epsilon^6)\) | ReLU | \(O(1/\epsilon^{2/(2 - \alpha)})\), \(\alpha\in(0,1]\) | | Tian et al. (2022) | L | \(\Omega(1/\epsilon^2)\) | ELU, GeLU | \(O(1/\epsilon^2)\) | | This paper | L | \(\Omega(1/\epsilon^2)\) | ELU, GeLU | \(O(1/\epsilon)\) | ### Conclusion: The paper successfully improves the sample complexity of neural TD learning and Q - learning from \(\tilde{O}(\epsilon^{-2})\) to \(\tilde{O}(\epsilon^{-1})\) by proposing a new subspace analysis technique.