Abstract:In this work, we decouple the iterative bi-level offline RL (value estimation and policy extraction) from the offline training phase, forming a non-iterative bi-level paradigm and avoiding the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization (value estimation) in training, while performing outer-level optimization (policy extraction) in testing. Naturally, such a paradigm raises three core questions that are not fully answered by prior non-iterative offline RL counterparts like reward-conditioned policy: (q1) What information should we transfer from the inner-level to the outer-level? (q2) What should we pay attention to when exploiting the transferred information for safe/confident outer-level optimization? (q3) What are the benefits of concurrently conducting outer-level optimization during testing? Motivated by model-based optimization (MBO), we propose DROP (design from policies), which fully answers the above questions. Specifically, in the inner-level, DROP decomposes offline data into multiple subsets, and learns an MBO score model (a1). To keep safe exploitation to the score model in the outer-level, we explicitly learn a behavior embedding and introduce a conservative regularization (a2). During testing, we show that DROP permits deployment adaptation, enabling an adaptive inference across states (a3). Empirically, we evaluate DROP on various tasks, showing that DROP gains comparable or better performance compared to prior methods.

Adaptive Spiking TD3+BC for Offline-to-Online Spiking Reinforcement Learning

DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization

Behavior Proximal Policy Optimization

A Rank-Based Sampling Framework for Offline Reinforcement Learning

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

A Low Latency Adaptive Coding Spike Framework for Deep Reinforcement Learning

TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from Mixed Datasets

Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

Interpretable performance analysis towards offline reinforcement learning: A dataset perspective

To Switch or Not to Switch? Balanced Policy Switching in Offline Reinforcement Learning

Combined Constraint on Behavior Cloning and Discriminator in Offline Reinforcement Learning

Boosting Offline Reinforcement Learning via Data Rebalancing

Improving and Benchmarking Offline Reinforcement Learning Algorithms

Robust Offline Reinforcement Learning from Low-Quality Data

Efficient Offline Reinforcement Learning: The Critic is Critical

Offline Reinforcement Learning with Behavioral Supervisor Tuning

Efficient Online Reinforcement Learning with Offline Data

Efficient Diffusion Policies for Offline Reinforcement Learning

Deadly triad matters for offline reinforcement learning