Off-Policy Training for Truncated TD(λ) Boosted Soft Actor-Critic.

Shiyu Huang,Bin Wang,Hang Su,Dong Li,Jianye Hao,Jun Zhu,Ting Chen
DOI: https://doi.org/10.1007/978-3-030-89370-5_4
2021-01-01
Abstract:TD( λ ) has become a crucial algorithm of modern reinforcement learning (RL). By introducing the trace decay parameter λ , TD( λ ) elegantly unifies Monte Carlo methods ( λ = 1 ) and one-step temporal difference prediction ( λ = 0 ), which can learn the optimal value significantly faster than extreme cases with an intermediate value of λ . However, it is mainly used in tabular or linear function approximation cases, which limits its practicality in large-scale learning and prevents it from adapting to modern deep RL methods. The main challenge of combining TD( λ ) with deep RL methods is the “deadly triad” problem between function approximation, bootstrapping and off-policy learning. To address this issue, we explore a new deep multi-step RL method, called SAC( λ ), to relieve this dilemma. Firstly, our method uses a new version of Soft Actor-Critic algorithm which stabilizes the learning of non-linear function approximation. Secondly, we introduce truncated TD( λ ) to reduce the impact of bootstrapping. Thirdly, we further use importance sampling as the off-policy correction. And the time complexity of the training process can be reduced via parallel updates and parameter sharing. Our experimental results show that SAC( λ ) can improve the training efficiency and the stability of off-policy learning. Our ablation study also shows the impact of changes in trace decay parameter λ and emerges some insights on how to choose an appropriate λ .
What problem does this paper attempt to address?