Abstract:Deep Q Network (DQN) is a very successful algorithm, yet the inherent problem of reinforcement learning, i.e. the exploit-explore balance, remains. In this work, we introduce entropy regularization into DQN and propose SQN. We find that the backup equation of soft Q learning can enjoy the corrective feedback if we view the soft backup as policy improvement in the form of Q, instead of policy evaluation. We show that Soft Q Learning with Corrective Feedback (SQL-CF) underlies the on-plicy nature of SQL and the equivalence of SQL and Soft Policy Gradient (SPG). With these insights, we propose an on-policy version of deep Q learning algorithm, i.e. Q On-Policy (QOP). We experiment with QOP on a self-play environment called Google Research Football (GRF). The QOP algorithm exhibits great stability and efficiency in training GRF agents.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the exploration - exploitation balance problem existing in Deep Q - Network (DQN), as well as the resulting problems such as sub - optimal solutions, low sample efficiency, and hyper - parameter sensitivity. Specifically: 1. **Sub - optimal solution problem**: In the algorithm design of DQN, a decaying exploration factor needs to be added and adjusted manually, which usually leads to good short - term performance but poor long - term performance. 2. **Low sample efficiency**: DQN, like other reinforcement learning methods, often requires a large number of interaction steps to solve problems. For example, in the Atari57 environment, billions of interaction steps are required. 3. **Hyper - parameter sensitivity**: DQN is very sensitive to hyper - parameters. Adjusting these parameters is both time - consuming and affects the reproducibility of results. To solve these problems, the author introduced entropy regularization into DQN and proposed the Soft Q - Network (SQN). Entropy regularization encourages exploration by introducing an entropy reward, so that the difficult - to - handle exploration factor is no longer needed, and exploration is controlled by an adaptive temperature parameter α instead. In addition, the paper also solved the "corrective feedback problem". In the standard SQL algorithm, the target value is usually calculated by applying Bellman backup on the previous Q - function, which may lead to uncorrected Q - value targets, thus harming the training efficiency. By modifying soft Q - learning, the author proposed a soft Q - learning with corrective feedback (SQL - CF), ensuring that each Bellman backup step is a policy improvement and has corrective feedback. Finally, based on the above research, the author proposed a distributed version of SQN - CF, namely the Q On - Policy (QOP) algorithm. QOP is an on - policy algorithm based on n - step backup. Experiments show that it exhibits great stability and high efficiency when training Google Research Football (GRF) agents. ### Summary of key formulas 1. **Objective function of maximum - entropy reinforcement learning**: \[ \pi^*=\arg\max_{\pi} \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t = 0}^{\infty} \gamma^t \left( r(s_t, a_t)+\alpha H(\pi(\cdot | s_{t + 1})) \right) \right] \] 2. **Soft Q - value function**: \[ Q^{\pi}(s, a)=\mathbb{E}_{\tau \sim \pi} \left[ \sum_{t = 0}^{\infty} \gamma^t \left( r(s_t, a_t)+\alpha H(\pi(\cdot | s_{t + 1})) \right)\mid s_0 = s, a_0 = a \right] \] 3. **Soft state - value function**: \[ V^{\pi}(s)=\mathbb{E}_{a \sim \pi} \left[ Q^{\pi}(s, a)+\alpha H(\pi(\cdot | s)) \right] \] 4. **Soft Bellman backup**: \[ Q^{\pi}(s, a)=r(s, a)+\gamma \mathbb{E}_{s' \sim P} \left[ V^{\pi}(s') \right] \] 5. **Soft policy improvement**: \[ \pi_{\text{new}}(\cdot | s_t)=\exp \left( \frac{Q^{\pi_{\text{old}}}(s_t

Soft Q Network

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Gradient Q(σ, Λ): A Unified Algorithm with Function Approximation for Reinforcement Learning

BrainQN: Enhancing the Robustness of Deep Reinforcement Learning with Spiking Neural Networks

Sparse Q-Learning: Offline Reinforcement Learning with Implicit Value Regularization

Does DQN Learn?

DQN with model-based exploration: efficient learning on environments with sparse rewards

Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning

On the Convergence and Sample Complexity Analysis of Deep Q-Networks with $ε$-Greedy Exploration

Qualitative Measurements of Policy Discrepancy for Return-Based Deep Q-Network

SVQN: Sequential Variational Soft Q-Learning Networks

Multiagent Soft Q-Learning

Regularized Softmax Deep Multi-Agent Q-Learning.

Using Deep Q-Learning to Control Optimization Hyperparameters

Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach

Deep Q-Learning: Theoretical Insights from an Asymptotic Analysis

XQSV: A Structurally Variable Network to Imitate Human Play in Xiangqi

Look where you look! Saliency-guided Q-networks for generalization in visual Reinforcement Learning

Graying the black box: Understanding DQNs

Local Planning Strategy Based on Deep Reinforcement Learning Over Estimation Suppression

Deep Reinforcement Learning with Double Q-Learning