PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling

Utsav Singh,Wesley A. Suttle,Brian M. Sadler,Vinay P. Namboodiri,Amrit Singh Bedi
2024-06-16
Abstract:In this work, we introduce PIPER: Primitive-Informed Preference-based Hierarchical reinforcement learning via Hindsight Relabeling, a novel approach that leverages preference-based learning to learn a reward model, and subsequently uses this reward model to relabel higher-level replay buffers. Since this reward is unaffected by lower primitive behavior, our relabeling-based approach is able to mitigate non-stationarity, which is common in existing hierarchical approaches, and demonstrates impressive performance across a range of challenging sparse-reward tasks. Since obtaining human feedback is typically impractical, we propose to replace the human-in-the-loop approach with our primitive-in-the-loop approach, which generates feedback using sparse rewards provided by the environment. Moreover, in order to prevent infeasible subgoal prediction and avoid degenerate solutions, we propose primitive-informed regularization that conditions higher-level policies to generate feasible subgoals for lower-level policies. We perform extensive experiments to show that PIPER mitigates non-stationarity in hierarchical reinforcement learning and achieves greater than 50$\%$ success rates in challenging, sparse-reward robotic environments, where most other baselines fail to achieve any significant progress.
Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address two major issues in Hierarchical Reinforcement Learning (HRL) within sparse reward environments: 1. **Non-stationarity Issue**: The non-stationarity problem caused by the dynamic changes in the behavior of lower-level policies, making it difficult for traditional offline HRL methods to train stably. 2. **Infeasible Subgoal Generation Issue**: The subgoals generated by higher-level policies may not be achievable by lower-level policies, leading to stagnation in the learning process. To solve these problems, the paper proposes the PIPER (Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling) method. PIPER addresses the above challenges through the following innovations: - Using Preference-based Learning to learn a high-level reward model that is unaffected by the non-stationarity of lower-level rewards and can generate feasible subgoals. - Proposing an alternative to human feedback called Primitive-in-the-Loop (PiL), which uses sparse rewards provided by the environment instead of human feedback to generate preferences between trajectories. - Introducing Hindsight Relabeling to increase sample efficiency and reduce the impact of sparse rewards. - Employing Primitive-informed Regularization to ensure that the subgoals generated by the higher-level policy are feasible for the lower-level policy. Experimental results show that PIPER achieves over a 50% success rate in complex sparse reward environments, significantly outperforming other baseline methods.