Abstract:Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. We build off the basic idea that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on "optimized" actions. To obtain these optimized actions, we first sample multiple actions from a base policy, and run global optimization (i.e., re-ranking multiple action samples using the Q-function) and local optimization (i.e., running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and transformer policies with either autoregressive tokens or continuous action outputs, at different sizes, entirely via actor-critic RL. Moreover, PA-RL improves the performance and sample-efficiency by up to 2 times compared to existing offline RL and online fine-tuning methods. We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm, improving from 40% to 70% in the real world in 40 minutes.

Leveraging Granularity: Hierarchical Reinforcement Learning for Pedagogical Policy Induction

Get a Head Start: On-Demand Pedagogical Policy Selection in Intelligent Tutoring

Bidirectional-Reachable Hierarchical Reinforcement Learning with Mutually Responsive Policies

Relabeling and policy distillation of hierarchical reinforcement learning

TGRL: An Algorithm for Teacher Guided Reinforcement Learning

Learning Two-Step Hybrid Policy for Graph-Based Interpretable Reinforcement Learning

Temporal-adaptive Hierarchical Reinforcement Learning

Hierarchical Programmatic Reinforcement Learning via Learning to Compose Programs

Algorithms for Batch Hierarchical Reinforcement Learning

Data-Efficient Hierarchical Reinforcement Learning for Robotic Assembly Control Applications

Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

HRL2E: Hierarchical Reinforcement Learning with Low-level Ensemble

MENTOR: Guiding Hierarchical Reinforcement Learning with Human Feedback and Dynamic Distance Constraint

Synthesizing Programmatic Reinforcement Learning Policies with Large Language Model Guided Search

Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

Sub-policy Adaptation for Hierarchical Reinforcement Learning

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

LGR2: Language Guided Reward Relabeling for Accelerating Hierarchical Reinforcement Learning

Prioritized League Reinforcement Learning for Large-Scale Heterogeneous Multiagent Systems

The Perfect Blend: Redefining RLHF with Mixture of Judges

Hierarchical Reinforcement Learning with Advantage-Based Auxiliary Rewards