What problem does this paper attempt to address?

This paper attempts to solve the problems of inaccurate gradient estimation and excessive variance in Deep Q - learning. Specifically, although current deep reinforcement learning algorithms have achieved human - level performance in practical applications, the following challenges still exist: 1. **Inaccurate gradient estimation**: Due to the randomness in the reinforcement learning training process, there are large deviations in gradient estimation. 2. **Excessive variance**: Excessive variance in the gradient makes training unstable and sample - inefficient. These problems can cause model parameters to deviate from the optimal settings, thus affecting the performance of the Deep Q - Network (DQN) and increasing the time to reach a local optimal solution. To solve these problems, the author proposes a new optimization strategy - **Stochastic Variance Reduction for Deep Q - learning (SVR - DQN)**. This method improves the accuracy of gradient estimation by introducing the Stochastic Variance Reduced Gradient (SVRG) technique, thereby accelerating convergence and improving sample efficiency. ### Specific problem description In large - scale Deep Q - learning problems, the Q - value is represented by the Deep Q - Network, and network parameters need to be adjusted by optimizing the loss function. The standard method is to use the gradient descent method, but due to the high cost of calculating the full - expectation gradient, the gradient of a small - batch sample is usually used for optimization. However, this method can lead to inaccurate gradient estimation and excessive variance, thus affecting model performance. To describe this problem more specifically, assume that the network parameters of DQN are \(\theta\), and the core learning step is to optimize by minimizing the gap between the estimated maximum Q - value \(y(s, a)\) and the current Q - value \(Q(s, a; \theta)\), that is: \[ \hat{\theta} = \arg\min_{\theta} E\left[\|y(s, a) - Q(s, a; \theta)\|^2\right] \] If the variance of gradient estimation is large, more iteration times are required to make \(\theta\) reach \(\hat{\theta}\), which means that a large gradient variance will delay the process of DQN reaching a local optimal solution. ### Solution For this reason, the author proposes the SVR - DQN optimization method to accelerate the convergence of Deep Q - learning by reducing the variance of the Approximate Gradient Estimation (AGE). Specifically, SVR - DQN utilizes the Stochastic Variance Reduced Gradient (SVRG) technique and is implemented through the following steps: 1. **Construct training sample batches**: Extract a batch \(B_s\) from all training samples and fix it for the entire optimization process. 2. **Calculate anchor point gradients**: Calculate the average gradient using the samples in \(B_s\) as the anchor point \(\tilde{\mu}_s\). 3. **Inner - layer iterative variance reduction**: Reduce the variance through randomly selected small - batch samples \(b_t\) and update parameters according to the update rule. 4. **Combine with Adam optimizer**: Pass the optimized gradient information to the Adam optimizer to further update the parameters. Through these steps, SVR - DQN can reduce the gradient variance while maintaining high sample efficiency and faster convergence speed. ### Experimental results The experimental results show that SVR - DQN significantly outperforms the baseline method in the Atari game environment, performs better in 18 games, and shows stronger learning ability in the initial training stage. In addition, the sample efficiency of SVR - DQN is almost twice that of the original Double - DQN, and it shows more stable performance in multiple games. In conclusion, this paper solves the problems of inaccurate gradient estimation and excessive variance in Deep Q - learning by introducing the SVRG technique, and improves the convergence speed and sample efficiency of the model.

Stochastic Variance Reduction for Deep Q-learning

Policy Optimization with Stochastic Mirror Descent.

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Gradient Q(σ, Λ): A Unified Algorithm with Function Approximation for Reinforcement Learning

VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning

Accelerated Stochastic ADMM with Variance Reduction

Adaptive Variance Reducing for Stochastic Gradient Descent.

Stochastic Sub-Sampled Newton Method with Variance Reduction

On the Reduction of Variance and Overestimation of Deep Q-Learning

Stochastic Variance-Reduced Policy Gradient

Variance Reduced Domain Randomization for Reinforcement Learning With Policy Gradient

A Coefficient Makes SVRG Effective

Variance aware reward smoothing for deep reinforcement learning

Larger is Better: The Effect of Learning Rates Enjoyed by Stochastic Optimization with Progressive Variance Reduction

On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

Stochastic Zeroth-order Optimization Via Variance Reduction Method.

Biologically Plausible Variational Policy Gradient with Spiking Recurrent Winner-Take-All Networks

Parameter-Free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients

Trading-off variance and complexity in stochastic gradient descent

Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization

Sample Complexity of Variance-reduced Distributionally Robust Q-learning