Abstract:Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. We build off the basic idea that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on "optimized" actions. To obtain these optimized actions, we first sample multiple actions from a base policy, and run global optimization (i.e., re-ranking multiple action samples using the Q-function) and local optimization (i.e., running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and transformer policies with either autoregressive tokens or continuous action outputs, at different sizes, entirely via actor-critic RL. Moreover, PA-RL improves the performance and sample-efficiency by up to 2 times compared to existing offline RL and online fine-tuning methods. We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm, improving from 40% to 70% in the real world in 40 minutes.

Off-Policy Training for Truncated TD(λ) Boosted Soft Actor-Critic.

Off-Policy Training for Truncated TD(\(\lambda \)) Boosted Soft Actor-Critic

A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

DSAC-T: Distributional Soft Actor-Critic with Three Refinements

Mitigating Estimation Errors by Twin TD-Regularized Actor and Critic for Deep Reinforcement Learning

Modified Retrace for Off-Policy Temporal Difference Learning.

TBQ($\sigma$): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

TBQ(σ): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning.

Revisiting a Design Choice in Gradient Temporal Difference Learning

Simplifying Deep Temporal Difference Learning

Investigating practical linear temporal difference learning

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

The Ladder in Chaos: A Simple and Effective Improvement to General DRL Algorithms by Policy Path Trimming and Boosting

Optimizing TD3 for 7-DOF Robotic Arm Grasping: Overcoming Suboptimality with Exploration-Enhanced Contrastive Learning

OPAC: Opportunistic Actor-Critic

Context-aware Active Multi-Step Reinforcement Learning

META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning

Double Actor-Critic with TD Error-Driven Regularization in Reinforcement Learning

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Multi-State TD Target for Model-Free Reinforcement Learning