Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning

Hanhan Zhou,Tian Lan,Vaneet Aggarwal
DOI: https://doi.org/10.48550/arXiv.2308.14897
2023-08-29
Abstract:Offline reinforcement learning aims to utilize datasets of previously gathered environment-action interaction records to learn a policy without access to the real environment. Recent work has shown that offline reinforcement learning can be formulated as a sequence modeling problem and solved via supervised learning with approaches such as decision transformer. While these sequence-based methods achieve competitive results over return-to-go methods, especially on tasks that require longer episodes or with scarce rewards, importance sampling is not considered to correct the policy bias when dealing with off-policy data, mainly due to the absence of behavior policy and the use of deterministic evaluation policies. To this end, we propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation (DPE) in a unified framework with statistically proven properties on variance reduction. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks. Our method brings a performance improvements on selected methods which outperforms SOTA baselines in several tasks, demonstrating the advantages of enabling double policy estimation for sequence-modeled reinforcement learning.
Machine Learning,Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the policy bias issue in sequence modeling methods for Offline Reinforcement Learning (Offline RL). Specifically, the authors propose a new algorithm—Double Policy Estimation (DPE)—to reduce the variance of Importance Sampling (IS) in sequence modeling for offline reinforcement learning. ### Background and Motivation 1. **Offline Reinforcement Learning**: - The goal of offline reinforcement learning is to learn policies using pre-collected datasets without interacting with the real environment. - Recent research has shown that offline reinforcement learning can be formulated as a sequence modeling problem and solved using supervised learning methods such as Decision Transformers. 2. **Importance Sampling**: - Importance sampling is a method used to correct policy bias, but it faces several challenges when dealing with offline data: - The behavior policy is usually unavailable. - The evaluation policy is typically deterministic, making it difficult to reweight different experiences or trajectories. - In long-horizon problems, the variance of importance sampling tends to be excessively high, leading to uninformative results. ### Proposed Method 1. **Double Policy Estimation (DPE)**: - DPE estimates both the behavior policy and the target policy simultaneously to compute the importance sampling estimates. - Specifically, DPE introduces maximum likelihood estimation for both the behavior policy and the target policy to calculate the likelihood ratio of state-action pairs in all offline data. 2. **Theoretical Analysis**: - The authors provide a theoretical analysis of the DPE estimator, proving that DPE can reduce the variance in learning the target policy. - By introducing estimates of both the behavior policy and the target policy, DPE can improve the performance of importance sampling in long-horizon tasks. ### Experimental Validation 1. **Experimental Setup**: - Validation was conducted on multiple tasks in OpenAI Gym using the D4RL benchmark dataset. - The dataset includes medium, medium-replay, and medium-expert datasets, containing mixed and suboptimal trajectories. 2. **Baseline Methods**: - Comparisons were made with various state-of-the-art baseline methods, including Decision Transformer (DT), Reward-conditioned Imitation Learning (RvS), Conservative Q-Learning (CQL), Behavior Cloning (BC), etc. 3. **Experimental Results**: - DPE performed excellently across multiple datasets, especially on medium and medium-expert datasets, significantly outperforming existing baseline methods. - The experimental results show that DPE achieved better performance in terms of average reward, particularly in long-horizon tasks. ### Conclusion This paper addresses the high variance issue of importance sampling in sequence modeling methods for offline reinforcement learning by proposing the Double Policy Estimation (DPE) algorithm. Experimental results demonstrate that DPE performs exceptionally well on multiple benchmark tasks, significantly improving the performance of offline reinforcement learning.