Abstract:Offline reinforcement learning aims to utilize datasets of previously gathered environment-action interaction records to learn a policy without access to the real environment. Recent work has shown that offline reinforcement learning can be formulated as a sequence modeling problem and solved via supervised learning with approaches such as decision transformer. While these sequence-based methods achieve competitive results over return-to-go methods, especially on tasks that require longer episodes or with scarce rewards, importance sampling is not considered to correct the policy bias when dealing with off-policy data, mainly due to the absence of behavior policy and the use of deterministic evaluation policies. To this end, we propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation (DPE) in a unified framework with statistically proven properties on variance reduction. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks. Our method brings a performance improvements on selected methods which outperforms SOTA baselines in several tasks, demonstrating the advantages of enabling double policy estimation for sequence-modeled reinforcement learning.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the policy bias issue in sequence modeling methods for Offline Reinforcement Learning (Offline RL). Specifically, the authors propose a new algorithm—Double Policy Estimation (DPE)—to reduce the variance of Importance Sampling (IS) in sequence modeling for offline reinforcement learning. ### Background and Motivation 1. **Offline Reinforcement Learning**: - The goal of offline reinforcement learning is to learn policies using pre-collected datasets without interacting with the real environment. - Recent research has shown that offline reinforcement learning can be formulated as a sequence modeling problem and solved using supervised learning methods such as Decision Transformers. 2. **Importance Sampling**: - Importance sampling is a method used to correct policy bias, but it faces several challenges when dealing with offline data: - The behavior policy is usually unavailable. - The evaluation policy is typically deterministic, making it difficult to reweight different experiences or trajectories. - In long-horizon problems, the variance of importance sampling tends to be excessively high, leading to uninformative results. ### Proposed Method 1. **Double Policy Estimation (DPE)**: - DPE estimates both the behavior policy and the target policy simultaneously to compute the importance sampling estimates. - Specifically, DPE introduces maximum likelihood estimation for both the behavior policy and the target policy to calculate the likelihood ratio of state-action pairs in all offline data. 2. **Theoretical Analysis**: - The authors provide a theoretical analysis of the DPE estimator, proving that DPE can reduce the variance in learning the target policy. - By introducing estimates of both the behavior policy and the target policy, DPE can improve the performance of importance sampling in long-horizon tasks. ### Experimental Validation 1. **Experimental Setup**: - Validation was conducted on multiple tasks in OpenAI Gym using the D4RL benchmark dataset. - The dataset includes medium, medium-replay, and medium-expert datasets, containing mixed and suboptimal trajectories. 2. **Baseline Methods**: - Comparisons were made with various state-of-the-art baseline methods, including Decision Transformer (DT), Reward-conditioned Imitation Learning (RvS), Conservative Q-Learning (CQL), Behavior Cloning (BC), etc. 3. **Experimental Results**: - DPE performed excellently across multiple datasets, especially on medium and medium-expert datasets, significantly outperforming existing baseline methods. - The experimental results show that DPE achieved better performance in terms of average reward, particularly in long-horizon tasks. ### Conclusion This paper addresses the high variance issue of importance sampling in sequence modeling methods for offline reinforcement learning by proposing the Double Policy Estimation (DPE) algorithm. Experimental results demonstrate that DPE performs exceptionally well on multiple benchmark tasks, significantly improving the performance of offline reinforcement learning.

Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Doubly Optimal Policy Evaluation for Reinforcement Learning

Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

Off-Policy Evaluation in Doubly Inhomogeneous Environments

Double Actors and Uncertainty-Weighted Critics for Offline Reinforcement Learning.

More Efficient Off-Policy Evaluation through Regularized Targeted Learning

DCE: Offline Reinforcement Learning with Double Conservative Estimates

Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective

Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning

SimuDICE: Offline Policy Optimization Through World Model Updates and DICE Estimation

Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling

OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators

Variance-Reduced Off-Policy Memory-Efficient Policy Search

SDV: Simple Double Validation Model-based Offline Reinforcement Learning

Off-Policy Evaluation via Off-Policy Classification

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Efficient Multi-Policy Evaluation for Reinforcement Learning

Primal-Dual Spectral Representation for Off-policy Evaluation