Imitating Graph-Based Planning with Goal-Conditioned Policies

Junsu Kim,Younggyo Seo,Sungsoo Ahn,Kyunghwan Son,Jinwoo Shin
2023-03-20
Abstract:Recently, graph-based planning algorithms have gained much attention to solve goal-conditioned reinforcement learning (RL) tasks: they provide a sequence of subgoals to reach the target-goal, and the agents learn to execute subgoal-conditioned policies. However, the sample-efficiency of such RL schemes still remains a challenge, particularly for long-horizon tasks. To address this issue, we present a simple yet effective self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy. Our intuition here is that to reach a target-goal, an agent should pass through a subgoal, so target-goal- and subgoal- conditioned policies should be similar to each other. We also propose a novel scheme of stochastically skipping executed subgoals in a planned path, which further improves performance. Unlike prior methods that only utilize graph-based planning in an execution phase, our method transfers knowledge from a planner along with a graph into policy learning. We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods under various long-horizon control tasks.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem this paper attempts to address is improving the sample efficiency of Goal-Conditioned Reinforcement Learning (GCRL) methods in long-horizon tasks. Specifically, the authors point out that although graph-based planning algorithms perform well in solving goal-conditioned reinforcement learning tasks, existing methods still face challenges in terms of sample efficiency, especially when dealing with long-horizon tasks. To tackle this challenge, the authors propose a new self-imitation learning framework—Planning-guided self-Imitation learning for Goal-conditioned policies (PIG), aiming to improve existing GCRL methods through the following two main contributions: 1. **Self-Imitation Learning during Training**: - The authors propose a new training objective that encourages the goal-conditioned policy to imitate the sub-goal-conditioned policy. The intuition is that to reach the final goal, the agent must go through a sub-goal, so the goal-conditioned policy and the sub-goal-conditioned policy should be similar. - Specifically, the authors design a loss term \( L_{PIG} \) to distill the planned sub-goal-conditioned policy into the goal-conditioned policy. 2. **Sub-goal Skipping during Execution**: - The authors propose a method to randomly skip sub-goals in the planned path to further improve sample efficiency. - During the execution phase, the agent can randomly "skip" some sub-goals, especially when the learned policy is strong enough. This skipping can help the agent find a better path to the goal. With these improvements, the authors hope to significantly enhance the sample efficiency of existing GCRL methods in various long-horizon control tasks. Experimental results show that PIG can indeed significantly improve the performance of existing methods like MSS in multiple complex environments, particularly excelling in tasks such as the large U-shaped AntMaze.