Demonstration-Based Proximal Policy Optimization with Action Guidance

Liu Zixuan,Zhang Qiyuan,Yang Jun,Zeng Kailin,Chen Bin
DOI: https://doi.org/10.1007/978-981-19-3927-3_84
2022-01-01
Abstract:Exploration remains a major challenge in the field of Reinforcement Learning (RL), especially in an environment in which there are sparse reward signals. Recently, it has been proved that some learning methods using expert demonstrations and enhanced exploration signals from the environment can effectively overcome exploration difficulty. However, they require massive high-quality expert data. To address this, this paper develops a novel Proximal Policy Optimization from Demonstration (PPOfD) method using the demonstrations combined with agent’s subsequent successful trajectories for the benefit of action guidance. Subsequently, an adaptive parameter is designed to integrate this idea with the proximal policy optimization more efficiently. Besides, we illustrate how PPOfD guides the formation of implicit rewards to bring demonstrable benefits for policy improvement. Finally, the performance of the proposed method is evaluated in a series of popular sparse-reward tasks. Experimental results show the proposed PPOfD surpasses three state-of-the-art baselines even when the demonstrations are few and imperfect.
What problem does this paper attempt to address?