Abstract:This work introduces Hierarchical Preference Optimization (HPO), a novel approach to hierarchical reinforcement learning (HRL) that addresses non-stationarity and infeasible subgoal generation issues when solving complex robotic control tasks. HPO leverages maximum entropy reinforcement learning combined with token-level Direct Preference Optimization (DPO), eliminating the need for pre-trained reference policies that are typically unavailable in challenging robotic scenarios. Mathematically, we formulate HRL as a bi-level optimization problem and transform it into a primitive-regularized DPO formulation, ensuring feasible subgoal generation and avoiding degenerate solutions. Extensive experiments on challenging robotic navigation and manipulation tasks demonstrate impressive performance of HPO, where it shows an improvement of up to 35% over the baselines. Furthermore, ablation studies validate our design choices, and quantitative analyses confirm the ability of HPO to mitigate non-stationarity and infeasible subgoal generation issues in HRL.

What problem does this paper attempt to address?

This paper attempts to solve the following two main problems: 1. **Non - stationarity**: - In Hierarchical Reinforcement Learning (HRL), when the lower - level policy keeps changing, it will cause the reward function of the higher - level policy to become unstable. This non - stationarity makes the previously collected experience data of the higher - level policy obsolete, thus reducing its effectiveness. - Specifically, because the behavior of the lower - level policy changes over time, the rewards received by the higher - level policy are non - stationary, which makes the training process unstable. 2. **Infeasible subgoal generation**: - The higher - level policy may generate subgoals that are difficult for the lower - level policy to achieve, which hinders the learning process and reduces the overall performance. - The sub - optimality of the lower - level policy affects its ability to reach a given subgoal, and further affects the credit assignment when subsequent subgoals are generated, which may lead to the higher - level policy generating infeasible subgoals. To solve these problems, the paper introduces a new method: **Hierarchical Preference Optimization (HPO)**. HPO solves the above problems in the following ways: - **Primitive - regularized Direct Preference Optimization (DPO)**: - By deriving a token - level DPO objective function in the maximum - entropy reinforcement learning framework and combining it with the lower - level value function for regularization, it is ensured that the generated subgoals are feasible. - This method avoids the need for a pre - trained reference policy, and directly optimizes the higher - level policy through preference data, thereby reducing the dependence on the changes of the lower - level policy and alleviating the non - stationarity problem. - **Modeling of the bi - level optimization problem**: - Model the HRL problem as a bi - level optimization problem to ensure that the nested structure of the problem is not destroyed during the optimization process, so as to effectively deal with the problems of non - stationarity and infeasible subgoal generation. - **Automatically generate preference feedback using sparse environmental rewards**: - In order to reduce the cost of collecting a large amount of preference data from human feedback, the paper proposes a primitive - in - the - loop - based method to use the sparse rewards provided by the environment to simulate preference feedback. In summary, HPO effectively solves the problems of non - stationarity and infeasible subgoal generation in HRL by introducing primitive - regularized DPO and the bi - level optimization framework, and shows significant performance improvement in complex robot control tasks.

Hierarchical Preference Optimization: Learning to achieve goals via feasible subgoals prediction

Beyond Reward: Offline Preference-guided Policy Optimization

DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled Hierarchical Reinforcement Learning

Bidirectional-Reachable Hierarchical Reinforcement Learning with Mutually Responsive Policies

Efficient Hierarchical Exploration with an Active Subgoal Generation Strategy.

Policy Optimization in RLHF: The Impact of Out-of-preference Data

Preference as Reward, Maximum Preference Optimization with Importance Sampling

Hierarchical reinforcement learning with natural language subgoals

Self-Improving Robust Preference Optimization

Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning

Hierarchical Reinforcement Learning with Advantage-Based Auxiliary Rewards

Hierarchical Potential-based Reward Shaping from Task Specifications

Hierarchical Reinforcement Learning in Complex 3D Environments

Data-Efficient Hierarchical Reinforcement Learning for Robotic Assembly Control Applications

PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling

Probabilistic Subgoal Representations for Hierarchical Reinforcement learning

Hierarchical Policy Learning is Sensitive to Goal Space Design

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

Sub-policy Adaptation for Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization

CRISP: Curriculum Inducing Primitive Informed Subgoal Prediction for Hierarchical Reinforcement Learning