Abstract:Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. We build off the basic idea that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on "optimized" actions. To obtain these optimized actions, we first sample multiple actions from a base policy, and run global optimization (i.e., re-ranking multiple action samples using the Q-function) and local optimization (i.e., running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and transformer policies with either autoregressive tokens or continuous action outputs, at different sizes, entirely via actor-critic RL. Moreover, PA-RL improves the performance and sample-efficiency by up to 2 times compared to existing offline RL and online fine-tuning methods. We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm, improving from 40% to 70% in the real world in 40 minutes.

Off-Policy Deep Reinforcement Learning with Analogous Disentangled Exploration

Non-local Policy Optimization via Diversity-regularized Collaborative Exploration

Efficient Reinforcement Learning via Decoupling Exploration and Utilization

Orthogonal Adversarial Deep Reinforcement Learning for Discrete- and Continuous-Action Problems

Never Give Up: Learning Directed Exploration Strategies

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

Learning Off-policy with Model-based Intrinsic Motivation For Active Online Exploration

Episodic Reinforcement Learning with Expanded State-reward Space

Random Latent Exploration for Deep Reinforcement Learning

Off-Policy Deep Reinforcement Learning Based on Steffensen Value Iteration

Decoupled Reinforcement Learning to Stabilise Intrinsically-Motivated Exploration

A Scalable Derivative-free Exploration Approach for Reinforcement Learning

Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models

RD2: Reward Decomposition with Representation Disentanglement

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Efficiently Training On-Policy Actor-Critic Networks in Robotic Deep Reinforcement Learning with Demonstration-like Sampled Exploration

DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards

Exploration in Deep Reinforcement Learning: From Single-Agent to Multiagent Domain