Abstract:Reinforcement Learning has achieved tremendous success in the many Atari games. In this paper we explored with the lunar lander environment and implemented classical methods including Q-Learning, SARSA, MC as well as tiling coding. We also implemented Neural Network based methods including DQN, Double DQN, Clipped DQN. On top of these, we proposed a new algorithm called Heuristic RL which utilizes heuristic to guide the early stage training while alleviating the introduced human bias. Our experiments showed promising results for our proposed methods in the lunar lander environment.
What problem does this paper attempt to address?
This paper attempts to solve the problem of low learning efficiency of algorithms in the early training stage in Reinforcement Learning (RL), especially in the Lunar Lander environment. Specifically:
1. **Challenges in Early Training**: In the early training stage, the agent knows very little about the environment and can only randomly explore the action space. This makes it very difficult for the agent to find effective strategies, especially in a sparsely - rewarded environment like Lunar Lander. The agent can only obtain a positive reward when it lands successfully, and this situation is very rare in the early stage.
2. **The Necessity of Introducing Heuristic Functions**: In order to help the agent find feasible solutions more quickly in the early training stage, the author proposes using heuristic functions to guide the training. These heuristic functions can help the agent explore the state space more effectively, thus accelerating the learning process.
3. **Avoiding Human - made Bias**: Although heuristic functions can accelerate learning, excessive reliance on them may introduce human - made bias and lead to local optimal solutions. Therefore, the author proposes a method of "vanishing bias", which utilizes heuristic functions in the early training stage and gradually reduces their influence as the training progresses, so that the agent finally depends on data - driven learning methods.
4. **Improved Deep Reinforcement Learning Algorithms**: The author not only implements classic reinforcement learning algorithms (such as Q - Learning, SARSA, Monte Carlo), but also implements neural - network - based deep reinforcement learning algorithms (such as DQN, Double DQN, Clipped DQN). On this basis, they propose a heuristic - guided deep reinforcement learning algorithm (Heuristic DQN) and demonstrate the effectiveness of these methods in experiments.
In summary, the main goal of this paper is to improve the learning efficiency of reinforcement learning algorithms in the Lunar Lander environment, especially the performance in the early training stage, by introducing heuristic functions and "vanishing bias" techniques. The experimental results show that this method can significantly improve the success rate and average score of the agent.
### Formula Summary
1. **Heuristic Function**:
\[
h(s_t, s_{t + 1})=\begin{cases}
k_1\cdot\phi(s_t, s_{t + 1})&\text{if }s_{t + 1}\in B_{\epsilon_1}^t\\
k_2\cdot\phi(s_t, s_{t + 1})&\text{otherwise}
\end{cases}
\]
where
\[
\phi(a, b)=\alpha\phi_1\left(\begin{bmatrix}a_x\\a_y\end{bmatrix},\begin{bmatrix}b_x\\b_y\end{bmatrix}\right)+\beta\phi_2\left(\begin{bmatrix}a_{\theta x}\\a_{\theta y}\end{bmatrix},\begin{bmatrix}b_{\theta x}\\b_{\theta y}\end{bmatrix}\right)
\]
\[
\phi_1\left(\begin{bmatrix}a_x\\a_y\end{bmatrix},\begin{bmatrix}b_x\\b_y\end{bmatrix}\right)=b_x^2 + b_y^2
\]
\(\phi_2\) represents the change in angle with respect to the vertical axis.
2. **Calculation of Target Q - value**:
\[
\hat{Q}(x_t, a_t)=r_t-\alpha_t h(x_t, x_{t + 1})+\gamma\max_{a'}Q(x_{t +}