Abstract:The effectiveness of reinforcement learning (RL) agents in continuous control robotics tasks is heavily dependent on the design of the underlying reward function. However, a misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences; however, they inadvertently introduce a risk of reward overoptimization. In this work, we address this challenge by advocating for the adoption of regularized reward functions that more accurately mirror the intended behaviors. We propose a novel concept of reward regularization within the robotic RLHF (RL from Human Feedback) framework, which we refer to as \emph{agent preferences}. Our approach uniquely incorporates not just human feedback in the form of preferences but also considers the preferences of the RL agent itself during the reward function learning process. This dual consideration significantly mitigates the issue of reward function overoptimization in RL. We provide a theoretical justification for the proposed approach by formulating the robotic RLHF problem as a bilevel optimization problem. We demonstrate the efficiency of our algorithm {\ours} in several continuous control benchmarks including DeepMind Control Suite \cite{tassa2018deepmind} and MetaWorld \cite{yu2021metaworld} and high dimensional visual environments, with an improvement of more than 70\% in sample efficiency in comparison to current SOTA baselines. This showcases our approach's effectiveness in aligning reward functions with true behavioral intentions, setting a new benchmark in the field.

Control Regularization for Reduced Variance Reinforcement Learning

Is High Variance Unavoidable in RL? A Case Study in Continuous Control

Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems

Optimal Scheduling of Entropy Regularizer for Continuous-Time Linear-Quadratic Reinforcement Learning

Variance aware reward smoothing for deep reinforcement learning

Decoupling regularization from the action space

REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback

Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems

Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning

Regularized Policy Gradients: Direct Variance Reduction in Policy Gradient Estimation.

Efficient Deep Reinforcement Learning Requires Regulating Overfitting

Robust Reinforcement Learning in Continuous Control Tasks with Uncertainty Set Regularization

Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space

Improving the Robustness of Reinforcement Learning Policies with $\mathcal{L}_{1}$ Adaptive Control

Value constrained model-free continuous control

Reinforcement Learning for a Discrete-Time Linear-Quadratic Control Problem with an Application

Deep RL With Information Constrained Policies: Generalization in Continuous Control

Benchmarking Smoothness and Reducing High-Frequency Oscillations in Continuous Control Policies

Regularity and stability of feedback relaxed controls

Offline Policy Optimization in RL with Variance Regularizaton

Continuous‐time mean–variance portfolio selection: A reinforcement learning framework