PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling

Utsav Singh,Wesley A. Suttle,Brian M. Sadler,Vinay P. Namboodiri,Amrit Singh Bedi

2024-06-16

Abstract:In this work, we introduce PIPER: Primitive-Informed Preference-based Hierarchical reinforcement learning via Hindsight Relabeling, a novel approach that leverages preference-based learning to learn a reward model, and subsequently uses this reward model to relabel higher-level replay buffers. Since this reward is unaffected by lower primitive behavior, our relabeling-based approach is able to mitigate non-stationarity, which is common in existing hierarchical approaches, and demonstrates impressive performance across a range of challenging sparse-reward tasks. Since obtaining human feedback is typically impractical, we propose to replace the human-in-the-loop approach with our primitive-in-the-loop approach, which generates feedback using sparse rewards provided by the environment. Moreover, in order to prevent infeasible subgoal prediction and avoid degenerate solutions, we propose primitive-informed regularization that conditions higher-level policies to generate feasible subgoals for lower-level policies. We perform extensive experiments to show that PIPER mitigates non-stationarity in hierarchical reinforcement learning and achieves greater than 50$\%$ success rates in challenging, sparse-reward robotic environments, where most other baselines fail to achieve any significant progress.

Machine Learning

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address two major issues in Hierarchical Reinforcement Learning (HRL) within sparse reward environments: 1. **Non-stationarity Issue**: The non-stationarity problem caused by the dynamic changes in the behavior of lower-level policies, making it difficult for traditional offline HRL methods to train stably. 2. **Infeasible Subgoal Generation Issue**: The subgoals generated by higher-level policies may not be achievable by lower-level policies, leading to stagnation in the learning process. To solve these problems, the paper proposes the PIPER (Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling) method. PIPER addresses the above challenges through the following innovations: - Using Preference-based Learning to learn a high-level reward model that is unaffected by the non-stationarity of lower-level rewards and can generate feasible subgoals. - Proposing an alternative to human feedback called Primitive-in-the-Loop (PiL), which uses sparse rewards provided by the environment instead of human feedback to generate preferences between trajectories. - Introducing Hindsight Relabeling to increase sample efficiency and reduce the impact of sparse rewards. - Employing Primitive-informed Regularization to ensure that the subgoals generated by the higher-level policy are feasible for the lower-level policy. Experimental results show that PIPER achieves over a 50% success rate in complex sparse reward environments, significantly outperforming other baseline methods.

PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling

STRAPPER: Preference-based Reinforcement Learning via Self-training Augmentation and Peer Regularization

DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled Hierarchical Reinforcement Learning

CRISP: Curriculum Inducing Primitive Informed Subgoal Prediction for Hierarchical Reinforcement Learning

Hierarchical Preference Optimization: Learning to achieve goals via feasible subgoals prediction

Deep Reinforcement Learning from Hierarchical Preference Design

Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning

Hindsight PRIORs for Reward Learning from Human Preferences

Hierarchical reinforcement learning with natural language subgoals

Relabeling and policy distillation of hierarchical reinforcement learning

LGR2: Language Guided Reward Relabeling for Accelerating Hierarchical Reinforcement Learning

Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning

PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training

Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation

Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning

Hindsight Preference Learning for Offline Preference-based Reinforcement Learning

Learning and reusing primitive behaviours to improve Hindsight Experience Replay sample efficiency

Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning

Hierarchical Potential-based Reward Shaping from Task Specifications

SHIRE: Enhancing Sample Efficiency using Human Intuition in REinforcement Learning

RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences