Abstract:Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. We build off the basic idea that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on "optimized" actions. To obtain these optimized actions, we first sample multiple actions from a base policy, and run global optimization (i.e., re-ranking multiple action samples using the Q-function) and local optimization (i.e., running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and transformer policies with either autoregressive tokens or continuous action outputs, at different sizes, entirely via actor-critic RL. Moreover, PA-RL improves the performance and sample-efficiency by up to 2 times compared to existing offline RL and online fine-tuning methods. We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm, improving from 40% to 70% in the real world in 40 minutes.

General policy mapping: online continual reinforcement learning inspired on the insect brain

Learning Off-policy with Model-based Intrinsic Motivation For Active Online Exploration

Lifelong Incremental Reinforcement Learning with Online Bayesian Inference

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Discovering neural policies to drive behaviour by integrating deep reinforcement learning agents with biological neural networks

LoopSR: Looping Sim-and-Real for Lifelong Policy Adaptation of Legged Robots

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions

Curriculum Goal-Conditioned Imitation for Offline Reinforcement Learning

Deep RL With Information Constrained Policies: Generalization in Continuous Control

Integrating human learning and reinforcement learning: A novel approach to agent training

Reinforcement Learning as a Robotics-Inspired Framework for Insect Navigation: From Spatial Representations to Neural Implementation

Environment as Policy: Learning to Race in Unseen Tracks

MOORe: Model-based Offline-to-Online Reinforcement Learning

Lifelong Reinforcement Learning with Modulating Masks

Using Offline Data to Speed-up Reinforcement Learning in Procedurally Generated Environments

Curriculum Reinforcement Learning via Morphology-Environment Co-Evolution

Reinforcement Learning with Brain-Inspired Modulation can Improve Adaptation to Environmental Changes

In-context Exploration-Exploitation for Reinforcement Learning

Lifelong Reinforcement Learning via Neuromodulation