Utsav Singh,Souradip Chakraborty,Wesley A. Suttle,Brian M. Sadler,Anit Kumar Sahu,Mubarak Shah,Vinay P. Namboodiri,Amrit Singh Bedi
Abstract:This work introduces Hierarchical Preference Optimization (HPO), a novel approach to hierarchical reinforcement learning (HRL) that addresses non-stationarity and infeasible subgoal generation issues when solving complex robotic control tasks. HPO leverages maximum entropy reinforcement learning combined with token-level Direct Preference Optimization (DPO), eliminating the need for pre-trained reference policies that are typically unavailable in challenging robotic scenarios. Mathematically, we formulate HRL as a bi-level optimization problem and transform it into a primitive-regularized DPO formulation, ensuring feasible subgoal generation and avoiding degenerate solutions. Extensive experiments on challenging robotic navigation and manipulation tasks demonstrate impressive performance of HPO, where it shows an improvement of up to 35% over the baselines. Furthermore, ablation studies validate our design choices, and quantitative analyses confirm the ability of HPO to mitigate non-stationarity and infeasible subgoal generation issues in HRL.
What problem does this paper attempt to address?
This paper attempts to solve the following two main problems:
1. **Non - stationarity**:
- In Hierarchical Reinforcement Learning (HRL), when the lower - level policy keeps changing, it will cause the reward function of the higher - level policy to become unstable. This non - stationarity makes the previously collected experience data of the higher - level policy obsolete, thus reducing its effectiveness.
- Specifically, because the behavior of the lower - level policy changes over time, the rewards received by the higher - level policy are non - stationary, which makes the training process unstable.
2. **Infeasible subgoal generation**:
- The higher - level policy may generate subgoals that are difficult for the lower - level policy to achieve, which hinders the learning process and reduces the overall performance.
- The sub - optimality of the lower - level policy affects its ability to reach a given subgoal, and further affects the credit assignment when subsequent subgoals are generated, which may lead to the higher - level policy generating infeasible subgoals.
To solve these problems, the paper introduces a new method: **Hierarchical Preference Optimization (HPO)**. HPO solves the above problems in the following ways:
- **Primitive - regularized Direct Preference Optimization (DPO)**:
- By deriving a token - level DPO objective function in the maximum - entropy reinforcement learning framework and combining it with the lower - level value function for regularization, it is ensured that the generated subgoals are feasible.
- This method avoids the need for a pre - trained reference policy, and directly optimizes the higher - level policy through preference data, thereby reducing the dependence on the changes of the lower - level policy and alleviating the non - stationarity problem.
- **Modeling of the bi - level optimization problem**:
- Model the HRL problem as a bi - level optimization problem to ensure that the nested structure of the problem is not destroyed during the optimization process, so as to effectively deal with the problems of non - stationarity and infeasible subgoal generation.
- **Automatically generate preference feedback using sparse environmental rewards**:
- In order to reduce the cost of collecting a large amount of preference data from human feedback, the paper proposes a primitive - in - the - loop - based method to use the sparse rewards provided by the environment to simulate preference feedback.
In summary, HPO effectively solves the problems of non - stationarity and infeasible subgoal generation in HRL by introducing primitive - regularized DPO and the bi - level optimization framework, and shows significant performance improvement in complex robot control tasks.