Deep Q-learning Sampling Based on Advantages

Ming Xie,Xinrui Ren,Jianbo Yu,Feng Shu
DOI: https://doi.org/10.1109/irce55557.2022.9963132
2022-01-01
Abstract:Deep Q-learning (DQN) has shown recent success on a wide range of complicated sequential decision-making issues, especially in the classic control area. However, in most DQN training, the sampling policies, particularly the $\epsilon$ -greedy policy and its derivation rules, produce noise erroneously due to the unsteadiness and unreliability of Q values utilized for searching for optimum actions. We begin with a fundamental hypothesis: complete advantages-based exploitation may be employed to help in successful exploration, and the gap between advantages on outstanding and poor actions tends to extend during the early phases of training before stabilizing. We then present a novel DQN architecture with a superior sampling policy that is totally and directly based on advantage values. This new DQN is named A-sampling DQN, and it requires minor code changes based on the original DQN codes. In most cases, A-sampling DQN and its derivatives outperformed the traditional DQN method.
What problem does this paper attempt to address?