Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity

Yan Yang,Bin Gao,Ya-xiang Yuan
2024-05-30
Abstract:Bilevel reinforcement learning (RL), which features intertwined two-level problems, has attracted growing interest recently. The inherent non-convexity of the lower-level RL problem is, however, to be an impediment to developing bilevel optimization methods. By employing the fixed point equation associated with the regularized RL, we characterize the hyper-gradient via fully first-order information, thus circumventing the assumption of lower-level convexity. This, remarkably, distinguishes our development of hyper-gradient from the general AID-based bilevel frameworks since we take advantage of the specific structure of RL problems. Moreover, we propose both model-based and model-free bilevel reinforcement learning algorithms, facilitated by access to the fully first-order hyper-gradient. Both algorithms are provable to enjoy the convergence rate $\mathcal{O}(\epsilon^{-1})$. To the best of our knowledge, this is the first time that AID-based bilevel RL gets rid of additional assumptions on the lower-level problem. In addition, numerical experiments demonstrate that the hyper-gradient indeed serves as an integration of exploitation and exploration.
Optimization and Control,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in bilevel reinforcement learning (BRL), how to develop hyper - gradients without assuming the convexity of the lower - level problem, thereby overcoming the optimization obstacles brought by the inherent non - convexity of the lower - level problem. Specifically, the paper focuses on the bilevel reinforcement learning problem in the following form: \[ \min_{x \in \mathbb{R}^n} \phi(x) := f(x, \pi^*(x)) \quad \text{s.t.} \quad \pi^*(x) \in \arg \min_{\pi \in \Pi} g(x, \pi) \] where: - $\Pi$ is the policy set. - $f$ is the upper - level objective function defined on $\mathbb{R}^n\times\Pi$. - $\phi(x)$ is called the hyper - objective function. - $\pi^*(x)$ is the optimal solution to the lower - level problem. ### Main Contributions 1. **Characteristics of Hyper - gradients**: By studying the fixed - point equations related to the entropy - regularized reinforcement learning problem, use the complete first - order information to characterize the hyper - gradients and reveal their properties. This extends the spirit of previous works [15, 24 - 26]. 2. **Algorithm Design**: Based on the understanding of hyper - gradients, construct its estimators, and propose a model - based bilevel reinforcement learning algorithm (M - SoBiRL) and its model - free version (SoBiRL). These two algorithms only require first - order oracles and avoid the complex second - order queries in the general AID - based methods. To the best of the authors' knowledge, this is the first time to propose an AID - based bilevel reinforcement learning algorithm without additional assumptions on the lower - level problem. 3. **Convergence Analysis**: Analyze the efficiency of amortizing hyper - gradient approximations through outer iterations in M - SoBiRL, that is, when the number of inner iterations $N = O(1)$ and is independent of the solution accuracy $\epsilon$, the algorithm enjoys a convergence rate of $O(\epsilon^{- 1})$. In the model - free scenario, enhanced convergence properties are also established. ### Application Examples The paper introduces two specific bilevel reinforcement learning application examples: 1. **Reward Shaping**: Shape the auxiliary reward function in the low - level training to improve the training efficiency of the agent while maintaining the consistency of the high - level environment so as to align with the initial task evaluation. 2. **Reinforcement Learning from Human Feedback (RLHF)**: Learn the intrinsic reward function containing expert knowledge from simple labels. The low - level optimizes the policy under the parameterized reward, and the high - level adjusts $x$ to align the preferences predicted by the reward model with the real labels. ### Experimental Results Experiments verify the effectiveness of the model - free algorithm SoBiRL on the Atari game BeamRider. The results show that even in the case of unknown rewards, the performance of SoBiRL is comparable to that of the baseline algorithm SAC, and within the same external time steps, SoBiRL achieves a higher per - episode return and a longer per - episode life. In addition, SoBiRL performs well in reward prediction and policy optimization, although it is slightly lower than DRLHF in the alignment accuracy of trajectory preferences. In conclusion, this paper solves the inherent non - convexity problem of the lower - level problem in bilevel reinforcement learning by introducing new hyper - gradient characterization methods and algorithm designs, providing new ideas and tools for further research in this field.