Efficient Data Use in Incremental Actor–critic Algorithms

Yuhu Cheng,Huanting Feng,Xuesong Wang
DOI: https://doi.org/10.1016/j.neucom.2011.11.034
IF: 6
2013-01-01
Neurocomputing
Abstract:Actor–critic (AC) reinforcement learning methods are on-line approximations to policy iterations and have wide application in solving large-scale Markov decision and high-dimensional learning control problems. In order to overcome data inefficiency of incremental AC algorithms based on temporal difference (AC-TD), two new incremental AC algorithms (i.e., AC-RLSTD and AC-iLSTD) are proposed by applying a recursive least-squares TD (RLSTD(λ)) algorithm and an incremental least-squares TD (iLSTD(λ)) algorithm to the Critic evaluation, which can make more efficient use of data than TD. The Critic estimates a value-function using the RLSTD(λ) or iLSTD(λ) algorithm and the Actor updates the policy based on a regular gradient obtained by the TD error. The improvement in learning evaluation efficiency of the Critic will contribute to the improvement in policy learning performance of the Actor. Simulation results on the learning control of an inverted pendulum and a mountain-car problem illustrate the effectiveness of the two proposed AC algorithms in comparison to the AC-TD algorithm. In addition the AC-iLSTD, using a greedy selection mechanism, can perform much better than the AC-iLSTD using a random selection mechanism. In the simulation, the effect of different parameter settings of the eligibility trace on the learning performance of AC algorithms is analyzed. Furthermore, it is found that different initial values of the variance matrix in the AC-RLSTD algorithm should be chosen appropriately to obtain better performance for different learning problems.
What problem does this paper attempt to address?