Implicit Posterior Sampling Reinforcement Learning for Continuous Control

Shaochen Wang,Bin Li
DOI: https://doi.org/10.1007/978-3-030-63833-7_38
2020-01-01
Abstract:Value function approximation has achieved notable success in reinforcement learning. Many popular algorithms (e.g. Deep Q Network) maintain a point estimation of the parameters in the value network or policy network. However, the frequentist perspective is prone to overfitting and lacks uncertainty representation. In this paper, we perform Bayesian analysis on the value function. Following the principle "optimism in the face of uncertainty", we conduct a posterior sampling of the value or policy network which implicitly captures the posterior distribution via a Bayesian hypernetwork. Experimental results show that the implicit posterior distribution for modeling the structural dependencies between parameters can better balance exploration and exploitation, and it is competitive to state-of-the-art methods on MuJoCo continuous benchmark.
What problem does this paper attempt to address?