Abstract:Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. We build off the basic idea that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on "optimized" actions. To obtain these optimized actions, we first sample multiple actions from a base policy, and run global optimization (i.e., re-ranking multiple action samples using the Q-function) and local optimization (i.e., running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and transformer policies with either autoregressive tokens or continuous action outputs, at different sizes, entirely via actor-critic RL. Moreover, PA-RL improves the performance and sample-efficiency by up to 2 times compared to existing offline RL and online fine-tuning methods. We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm, improving from 40% to 70% in the real world in 40 minutes.

Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress

Replay across Experiments: A Natural Extension of Off-Policy RL

Selective Reincarnation: Offline-to-Online Multi-Agent Reinforcement Learning

ReLIC: A Recipe for 64k Steps of In-Context Reinforcement Learning for Embodied AI

Efficient Online Reinforcement Learning with Offline Data

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Jointly Pre-training with Supervised, Autoencoder, and Value Losses for Deep Reinforcement Learning

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own

Decoding Reinforcement Learning for newcomers

Model-Free Generative Replay for Lifelong Reinforcement Learning: Application to Starcraft-2

Learning and reusing primitive behaviours to improve Hindsight Experience Replay sample efficiency

Integrating human learning and reinforcement learning: A novel approach to agent training

Loss of Plasticity in Continual Deep Reinforcement Learning

Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform

Recursive Reinforcement Learning

Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones

Policy Rehearsing: Training Generalizable Policies for Reinforcement Learning

Scaling Population-Based Reinforcement Learning with GPU Accelerated Simulation

From Imitation to Refinement -- Residual RL for Precise Assembly

Decomposing Control Lyapunov Functions for Efficient Reinforcement Learning