Gradient compensation traces based temporal difference learning
Wang Bi,Li Xuelian,Gao Zhiqiang,Chen Yang
DOI: https://doi.org/10.1016/j.neucom.2021.02.042
IF: 6
2021-06-01
Neurocomputing
Abstract:<p>For online updates and data efficiency, forward-view algorithms are transformed into backward-views, such as temporal difference learning (TD) and its control versions, by eligibility traces. Existing researches on eligibility traces, such as TD(<span class="math"><math>λ</math></span>) and true-online TD(<span class="math"><math>λ</math></span>), mainly focus on the equivalence between forward-views and backward-views. However, the choice of <span class="math"><math>λ</math></span> refers to the time scope of the credit-assignment, and a small <span class="math"><math>λ</math></span> accelerates the decay of credit over the time. This paper takes a different implementation of the backward-view named gradient compensation traces (GCT). GCT compensates the difference between a bootstrapping estimated gradient and the true gradient online to remove the extra decay of the credit. Based on GCT, the corresponding temporal difference learning (gradient compensation TD, GCTD) is proved to converge conditionally. The sensitivity of GCTD's hyper-parameter is analyzed in the nonlinear long-corridor and linear random-walk task. The proposed algorithm is comparable with true-online TD(<span class="math"><math>λ</math></span>) in the <em>basic</em> Mountain Car task, and outperforms the baselines in the <em>reward sparse</em> setting.</p>
computer science, artificial intelligence