Abstract:Recently, graph-based planning algorithms have gained much attention to solve goal-conditioned reinforcement learning (RL) tasks: they provide a sequence of subgoals to reach the target-goal, and the agents learn to execute subgoal-conditioned policies. However, the sample-efficiency of such RL schemes still remains a challenge, particularly for long-horizon tasks. To address this issue, we present a simple yet effective self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy. Our intuition here is that to reach a target-goal, an agent should pass through a subgoal, so target-goal- and subgoal- conditioned policies should be similar to each other. We also propose a novel scheme of stochastically skipping executed subgoals in a planned path, which further improves performance. Unlike prior methods that only utilize graph-based planning in an execution phase, our method transfers knowledge from a planner along with a graph into policy learning. We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods under various long-horizon control tasks.

What problem does this paper attempt to address?

The main problem this paper attempts to address is improving the sample efficiency of Goal-Conditioned Reinforcement Learning (GCRL) methods in long-horizon tasks. Specifically, the authors point out that although graph-based planning algorithms perform well in solving goal-conditioned reinforcement learning tasks, existing methods still face challenges in terms of sample efficiency, especially when dealing with long-horizon tasks. To tackle this challenge, the authors propose a new self-imitation learning framework—Planning-guided self-Imitation learning for Goal-conditioned policies (PIG), aiming to improve existing GCRL methods through the following two main contributions: 1. **Self-Imitation Learning during Training**: - The authors propose a new training objective that encourages the goal-conditioned policy to imitate the sub-goal-conditioned policy. The intuition is that to reach the final goal, the agent must go through a sub-goal, so the goal-conditioned policy and the sub-goal-conditioned policy should be similar. - Specifically, the authors design a loss term \( L_{PIG} \) to distill the planned sub-goal-conditioned policy into the goal-conditioned policy. 2. **Sub-goal Skipping during Execution**: - The authors propose a method to randomly skip sub-goals in the planned path to further improve sample efficiency. - During the execution phase, the agent can randomly "skip" some sub-goals, especially when the learned policy is strong enough. This skipping can help the agent find a better path to the goal. With these improvements, the authors hope to significantly enhance the sample efficiency of existing GCRL methods in various long-horizon control tasks. Experimental results show that PIG can indeed significantly improve the performance of existing methods like MSS in multiple complex environments, particularly excelling in tasks such as the large U-shaped AntMaze.

Imitating Graph-Based Planning with Goal-Conditioned Policies

Learning Hierarchical Graph-Based Policy for Goal-Reaching in Unknown Environments

GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models

Hierarchical Planning Through Goal-Conditioned Offline Reinforcement Learning

Self-imitation guided goal-conditioned reinforcement learning

A Fully Controllable Agent in the Path Planning using Goal-Conditioned Reinforcement Learning

Planning with a Learned Policy Basis to Optimally Solve Complex Tasks

Combining Subgoal Graphs with Reinforcement Learning to Build a Rational Pathfinder

Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

Goal-conditioned Offline Planning from Curious Exploration

Deep Imitative Models for Flexible Inference, Planning, and Control

Goal-Conditioned Reinforcement Learning with Disentanglement-based Reachability Planning

Goal-Reaching Policy Learning from Non-Expert Observations via Effective Subgoal Guidance

Phasic Self-Imitative Reduction for Sparse-Reward Goal-Conditioned Reinforcement Learning

Multiple Suboptimal Policies Integrated Reinforcement Learning Algorithm for Path Planning

Planning for Sample Efficient Imitation Learning

Learn Goal-Conditioned Policy with Intrinsic Motivation for Deep Reinforcement Learning

Guided Imitation of Task and Motion Planning

Curriculum Goal-Conditioned Imitation for Offline Reinforcement Learning

Explicit-Implicit Subgoal Planning for Long-Horizon Tasks with Sparse Reward

Ricci Planner: Zero-Shot Transfer for Goal-Conditioned Reinforcement Learning via Geometric Flow