Episode-Experience Replay Based Tree-Backup Method for Off-Policy Actor-Critic Algorithm.

Haobo Jiang,Jianjun Qian,Jin Xie,Jian Yang
DOI: https://doi.org/10.1007/978-3-030-03398-9_48
2018-01-01
Abstract:Off-policy algorithms have played important roles in deep reinforcement learning. Since the off-policy based policy gradient is a biased estimation, the previous works employed importance sampling to achieve the unbiased estimation, where the behavior policy is known in advance. However, it is difficult to choose the reasonable behavior policy for complex agents. Moreover, importance sampling usually produces the large variance. To address these problems, this paper presents a novel actor-critic policy gradient algorithm. Specifically, we employ the tree-backup method in off-policy setting to achieve the unbiased estimation of target policy gradient without using importance sampling. Meanwhile, we combine the naive episode-experience replay and the experience replay to obtain the trajectory samples and reduce the strong correlations between these samples. The experimental results demonstrate the advantages of the proposed method over the competed methods.
What problem does this paper attempt to address?