Stochastic Variance Reduction for Deep Q-learning

Wei-Ye Zhao,Xi-Ya Guan,Yang Liu,Xiaoming Zhao,Jian Peng
DOI: https://doi.org/10.48550/arXiv.1905.08152
2019-05-20
Abstract:Recent advances in deep reinforcement learning have achieved human-level performance on a variety of real-world applications. However, the current algorithms still suffer from poor gradient estimation with excessive variance, resulting in unstable training and poor sample efficiency. In our paper, we proposed an innovative optimization strategy by utilizing stochastic variance reduced gradient (SVRG) techniques. With extensive experiments on Atari domain, our method outperforms the deep q-learning baselines on 18 out of 20 games.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problems of inaccurate gradient estimation and excessive variance in Deep Q - learning. Specifically, although current deep reinforcement learning algorithms have achieved human - level performance in practical applications, the following challenges still exist: 1. **Inaccurate gradient estimation**: Due to the randomness in the reinforcement learning training process, there are large deviations in gradient estimation. 2. **Excessive variance**: Excessive variance in the gradient makes training unstable and sample - inefficient. These problems can cause model parameters to deviate from the optimal settings, thus affecting the performance of the Deep Q - Network (DQN) and increasing the time to reach a local optimal solution. To solve these problems, the author proposes a new optimization strategy - **Stochastic Variance Reduction for Deep Q - learning (SVR - DQN)**. This method improves the accuracy of gradient estimation by introducing the Stochastic Variance Reduced Gradient (SVRG) technique, thereby accelerating convergence and improving sample efficiency. ### Specific problem description In large - scale Deep Q - learning problems, the Q - value is represented by the Deep Q - Network, and network parameters need to be adjusted by optimizing the loss function. The standard method is to use the gradient descent method, but due to the high cost of calculating the full - expectation gradient, the gradient of a small - batch sample is usually used for optimization. However, this method can lead to inaccurate gradient estimation and excessive variance, thus affecting model performance. To describe this problem more specifically, assume that the network parameters of DQN are \(\theta\), and the core learning step is to optimize by minimizing the gap between the estimated maximum Q - value \(y(s, a)\) and the current Q - value \(Q(s, a; \theta)\), that is: \[ \hat{\theta} = \arg\min_{\theta} E\left[\|y(s, a) - Q(s, a; \theta)\|^2\right] \] If the variance of gradient estimation is large, more iteration times are required to make \(\theta\) reach \(\hat{\theta}\), which means that a large gradient variance will delay the process of DQN reaching a local optimal solution. ### Solution For this reason, the author proposes the SVR - DQN optimization method to accelerate the convergence of Deep Q - learning by reducing the variance of the Approximate Gradient Estimation (AGE). Specifically, SVR - DQN utilizes the Stochastic Variance Reduced Gradient (SVRG) technique and is implemented through the following steps: 1. **Construct training sample batches**: Extract a batch \(B_s\) from all training samples and fix it for the entire optimization process. 2. **Calculate anchor point gradients**: Calculate the average gradient using the samples in \(B_s\) as the anchor point \(\tilde{\mu}_s\). 3. **Inner - layer iterative variance reduction**: Reduce the variance through randomly selected small - batch samples \(b_t\) and update parameters according to the update rule. 4. **Combine with Adam optimizer**: Pass the optimized gradient information to the Adam optimizer to further update the parameters. Through these steps, SVR - DQN can reduce the gradient variance while maintaining high sample efficiency and faster convergence speed. ### Experimental results The experimental results show that SVR - DQN significantly outperforms the baseline method in the Atari game environment, performs better in 18 games, and shows stronger learning ability in the initial training stage. In addition, the sample efficiency of SVR - DQN is almost twice that of the original Double - DQN, and it shows more stable performance in multiple games. In conclusion, this paper solves the problems of inaccurate gradient estimation and excessive variance in Deep Q - learning by introducing the SVRG technique, and improves the convergence speed and sample efficiency of the model.