Abstract:Designing reward functions for efficiently guiding reinforcement learning (RL) agents toward specific behaviors is a complex task. This is challenging since it requires the identification of reward structures that are not sparse and that avoid inadvertently inducing undesirable behaviors. Naively modifying the reward structure to offer denser and more frequent feedback can lead to unintended outcomes and promote behaviors that are not aligned with the designer's intended goal. Although potential-based reward shaping is often suggested as a remedy, we systematically investigate settings where deploying it often significantly impairs performance. To address these issues, we introduce a new framework that uses a bi-level objective to learn \emph{behavior alignment reward functions}. These functions integrate auxiliary rewards reflecting a designer's heuristics and domain knowledge with the environment's primary rewards. Our approach automatically determines the most effective way to blend these types of feedback, thereby enhancing robustness against heuristic reward misspecification. Remarkably, it can also adapt an agent's policy optimization process to mitigate suboptimalities resulting from limitations and biases inherent in the underlying RL algorithms. We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges. We investigate heuristic auxiliary rewards of varying quality -- some of which are beneficial and others detrimental to the learning process. Our results show that our framework offers a robust and principled way to integrate designer-specified heuristics. It not only addresses key shortcomings of existing approaches but also consistently leads to high-performing solutions, even when given misaligned or poorly-specified auxiliary reward functions.

Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning

Reward Mechanism Design for Deep Reinforcement Learning-Based Microgrid Energy Management

A Dynamic Adjusting Reward Function Method for Deep Reinforcement Learning with Adjustable Parameters

Dueling Network Architecture for Multi-Agent Deep Deterministic Policy Gradient

Adaptively Shaping Reinforcement Learning Agents Via Human Reward

Learning Fair Policies in Multi-Objective (deep) Reinforcement Learning with Average and Discounted Rewards.

On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

Towards Efficient Exact Optimization of Language Model Alignment

An Incremental Optimization Approach to Address the Spatiotemporal Reward Coupling Effects in Deep Reinforcement Learning for Path Planning

Optimizing the Long-Term Average Reward for Continuing MDPs: A Technical Report

Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping

Behavior Alignment via Reward Function Optimization

Orientation-Preserving Rewards’ Balancing in Reinforcement Learning

Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation

Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning

Deep Learning and Reward Design for Reinforcement Learning

Principled Reward Shaping for Reinforcement Learning Via Lyapunov Stability Theory

Policy Optimization for Continuous Reinforcement Learning

Average-Reward Reinforcement Learning with Trust Region Methods

The Guiding Role of Reward Based on Phased Goal in Reinforcement Learning.