Abstract:Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. We build off the basic idea that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on "optimized" actions. To obtain these optimized actions, we first sample multiple actions from a base policy, and run global optimization (i.e., re-ranking multiple action samples using the Q-function) and local optimization (i.e., running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and transformer policies with either autoregressive tokens or continuous action outputs, at different sizes, entirely via actor-critic RL. Moreover, PA-RL improves the performance and sample-efficiency by up to 2 times compared to existing offline RL and online fine-tuning methods. We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm, improving from 40% to 70% in the real world in 40 minutes.

Decision Mamba: A Multi-Grained State Space Model with Self-Evolution Regularization for Offline RL

Mamba as Decision Maker: Exploring Multi-scale Sequence Modeling in Offline Reinforcement Learning

Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling

Integrating Multi-Modal Input Token Mixer Into Mamba-Based Decision Models: Decision MetaMamba

KalMamba: Towards Efficient Probabilistic State Space Models for RL under Uncertainty

Is Mamba Compatible with Trajectory Optimization in Offline Reinforcement Learning?

Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient

Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning

Mutual Information Regularized Offline Reinforcement Learning

Decision Mamba: Reinforcement Learning via Sequence Modeling with Selective State Spaces

Planning, Fast and Slow: Online Reinforcement Learning with Action-Free Offline Data Via Multiscale Planners

Goal-conditioned Offline Reinforcement Learning through State Space Partitioning

Policy-regularized Offline Multi-objective Reinforcement Learning

Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning

Decision S4: Efficient Sequence-Based RL via State Spaces Layers

Decision Mamba Architectures

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

HarmoDT: Harmony Multi-Task Decision Transformer for Offline Reinforcement Learning

MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning