Abstract:In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm called Iterative learning from Corrective actions and Proxy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent's demonstrated trajectories; (2) Incorporate these corrective actions into the Q-function using a margin loss to enforce adherence to labeler's preferences; (3) Train the agent with standard RL losses regularized with a margin loss to learn from proxy rewards and propagate the Q-values learned from human feedback. Moreover, another novel design in our approach is to integrate pseudo-labels from the target Q-network to reduce human labor and further stabilize training. We experimentally validate our proposition on a variety of tasks (Atari games and autonomous driving on highway). On the one hand, using proxy rewards with different levels of imperfection, our method can better align with human preferences and is more sample-efficient than baseline methods. On the other hand, facing corrective actions with different types of imperfection, our method can overcome the non-optimality of this feedback thanks to the guidance from proxy reward.

Improving Deep Reinforcement Learning with Mirror Loss

State Representation Learning for Effective Deep Reinforcement Learning.

Sparse Q-learning with Mirror Descent

Sample-efficient multi-agent reinforcement learning with masked reconstruction

A Deep Reinforcement Learning Agent for Geometry Online Tutoring

Mirror: A Multiple-perspective Self-Reflection Method for Knowledge-rich Reasoning

Learning mirror maps in policy mirror descent

State Representation Learning with Adjacent State Consistency Loss for Deep Reinforcement Learning.

What deep reinforcement learning tells us about human motor learning and vice-versa

Symmetry Considerations for Learning Task Symmetric Robot Policies

Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning

A Sampling-based Learning Framework for Big Databases

Deep Reinforcement Learning with Decorrelation

MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning

From mimic to counteract: a two-stage reinforcement learning algorithm for Google research football

Soft Hindsight Experience Replay

Discovered Policy Optimisation

Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales

Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards

Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Efficient Diversity-based Experience Replay for Deep Reinforcement Learning