Demonstration actor critic
Guoqing Liu,Li Zhao,Pushi Zhang,Jiang Bian,Tao Qin,Nenghai Yu,Tie-Yan Liu
DOI: https://doi.org/10.1016/j.neucom.2020.12.116
IF: 6
2021-04-01
Neurocomputing
Abstract:<p>We study the problem of Reinforcement Learning from Demonstrations (RLfD), where the agent has access to not only reward signals from the environment, but also some available expert demonstrations. Recent works absorb ingredients from imitation learning and utilize demonstration data as reward reshaping. Despite their success, these methods update policy over these states seen in the demonstration data, in the same way as other states in the state space, overlooking the validity of direct supervision signals on these states. To address this issue, we propose a novel RLfD objective function with a new shaping reward, by optimizing which can directly leverage the supervision signal on these demonstrated states. We propose a general framework for policy optimization of the proposed objective, with convergence guarantees under the classic tabular setting. Based on that, we further make some approximations based on deep neural networks, and then introduce a new practical algorithm, called Demonstration Actor Critic (DAC) in large continuous domains. Extensive experiments on a range of popular benchmark sparse-reward tasks show that our method can lead to significant performance gains over several strong and off-the-shelf baselines.</p>
computer science, artificial intelligence