Soft Q Network

Jingbin Liu,Shuai Liu,Xinyang Gu
DOI: https://doi.org/10.48550/arXiv.1912.10891
2020-12-14
Abstract:Deep Q Network (DQN) is a very successful algorithm, yet the inherent problem of reinforcement learning, i.e. the exploit-explore balance, remains. In this work, we introduce entropy regularization into DQN and propose SQN. We find that the backup equation of soft Q learning can enjoy the corrective feedback if we view the soft backup as policy improvement in the form of Q, instead of policy evaluation. We show that Soft Q Learning with Corrective Feedback (SQL-CF) underlies the on-plicy nature of SQL and the equivalence of SQL and Soft Policy Gradient (SPG). With these insights, we propose an on-policy version of deep Q learning algorithm, i.e. Q On-Policy (QOP). We experiment with QOP on a self-play environment called Google Research Football (GRF). The QOP algorithm exhibits great stability and efficiency in training GRF agents.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the exploration - exploitation balance problem existing in Deep Q - Network (DQN), as well as the resulting problems such as sub - optimal solutions, low sample efficiency, and hyper - parameter sensitivity. Specifically: 1. **Sub - optimal solution problem**: In the algorithm design of DQN, a decaying exploration factor needs to be added and adjusted manually, which usually leads to good short - term performance but poor long - term performance. 2. **Low sample efficiency**: DQN, like other reinforcement learning methods, often requires a large number of interaction steps to solve problems. For example, in the Atari57 environment, billions of interaction steps are required. 3. **Hyper - parameter sensitivity**: DQN is very sensitive to hyper - parameters. Adjusting these parameters is both time - consuming and affects the reproducibility of results. To solve these problems, the author introduced entropy regularization into DQN and proposed the Soft Q - Network (SQN). Entropy regularization encourages exploration by introducing an entropy reward, so that the difficult - to - handle exploration factor is no longer needed, and exploration is controlled by an adaptive temperature parameter α instead. In addition, the paper also solved the "corrective feedback problem". In the standard SQL algorithm, the target value is usually calculated by applying Bellman backup on the previous Q - function, which may lead to uncorrected Q - value targets, thus harming the training efficiency. By modifying soft Q - learning, the author proposed a soft Q - learning with corrective feedback (SQL - CF), ensuring that each Bellman backup step is a policy improvement and has corrective feedback. Finally, based on the above research, the author proposed a distributed version of SQN - CF, namely the Q On - Policy (QOP) algorithm. QOP is an on - policy algorithm based on n - step backup. Experiments show that it exhibits great stability and high efficiency when training Google Research Football (GRF) agents. ### Summary of key formulas 1. **Objective function of maximum - entropy reinforcement learning**: \[ \pi^*=\arg\max_{\pi} \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t = 0}^{\infty} \gamma^t \left( r(s_t, a_t)+\alpha H(\pi(\cdot | s_{t + 1})) \right) \right] \] 2. **Soft Q - value function**: \[ Q^{\pi}(s, a)=\mathbb{E}_{\tau \sim \pi} \left[ \sum_{t = 0}^{\infty} \gamma^t \left( r(s_t, a_t)+\alpha H(\pi(\cdot | s_{t + 1})) \right)\mid s_0 = s, a_0 = a \right] \] 3. **Soft state - value function**: \[ V^{\pi}(s)=\mathbb{E}_{a \sim \pi} \left[ Q^{\pi}(s, a)+\alpha H(\pi(\cdot | s)) \right] \] 4. **Soft Bellman backup**: \[ Q^{\pi}(s, a)=r(s, a)+\gamma \mathbb{E}_{s' \sim P} \left[ V^{\pi}(s') \right] \] 5. **Soft policy improvement**: \[ \pi_{\text{new}}(\cdot | s_t)=\exp \left( \frac{Q^{\pi_{\text{old}}}(s_t