Abstract:Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing Q-values using out-of-distribution (OOD) actions will suffer from errors due to distributional shift. The recently proposed In-sample Learning paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the Implicit Value Regularization (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose two practical algorithms, Sparse Q-learning (SQL) and Exponential Q-learning (EQL), which adopt the same value regularization used in existing works, but in a complete in-sample manner. Compared with IQL, we find that our algorithms introduce sparsity in learning the value function, making them more robust in noisy data regimes. We also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and show the benefits of in-sample learning by comparing them with CQL in small data regimes.

Judgmentally adjusted Q-values based on Q-ensemble for offline reinforcement learning

A Rank-Based Sampling Framework for Offline Reinforcement Learning

Q-Distribution guided Q-learning for offline reinforcement learning: Uncertainty penalized Q-value via consistency model

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Mildly Conservative Q-Learning for Offline Reinforcement Learning

Adaptable Conservative Q-Learning for Offline Reinforcement Learning.

Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration

Boosting Offline Reinforcement Learning via Data Rebalancing

Offline Reinforcement Learning with Imbalanced Datasets

Interpretable performance analysis towards offline reinforcement learning: A dataset perspective

Adaptive pessimism via target Q-value for offline reinforcement learning

Constrained Policy Optimization with Explicit Behavior Density for Offline Reinforcement Learning

Offline RL with No OOD Actions: In-Sample Learning Via Implicit Value Regularization

ENOTO: Improving Offline-to-Online Reinforcement Learning with Q-Ensembles

Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

Constraints Penalized Q-learning for Safe Offline Reinforcement Learning.

Offline Reinforcement Learning with On-Policy Q-Function Regularization

Identifying drug-induced lung injury in a patient with rheumatoid arthritis

Offline Reinforcement Learning for Wireless Network Optimization with Mixture Datasets

Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning

Exclusively Penalized Q-learning for Offline Reinforcement Learning