Abstract:Recent policy optimization approaches (Schulman et al., 2015a; 2017) have achieved substantial empirical successes by constructing new proxy optimization objectives. These proxy objectives allow stable and low variance policy learning, but require small policy updates to ensure that the proxy objective remains an accurate approximation of the target policy value. In this paper we derive an alternative objective that obtains the value of the target policy by applying importance sampling (IS). However, the basic importance sampled objective is not suitable for policy optimization, as it incurs too high variance in policy updates. We therefore introduce an approximation that allows us to directly trade-off the bias of approximation with the variance in policy updates. We show that our approximation unifies previously developed approaches and allows us to interpolate between them. We develop a practical algorithm by optimizing the introduced objective with proximal policy optimization techniques (Schulman et al., 2017). We also provide a theoretical analysis of the introduced policy optimization objective demonstrating bias-variance trade-off. We empirically demonstrate that the resulting algorithm improves upon state of the art on-policy policy optimization on continuous control benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: In Policy Optimization, existing methods achieve stable and low - variance policy learning by constructing new surrogate optimization objectives, but these methods require a small policy update magnitude to ensure that the surrogate objective remains an accurate approximation of the target policy value. This paper proposes a new objective function based on Importance Sampling (IS), aiming to directly balance the approximation bias and the variance in policy updates, thereby improving the policy optimization performance in continuous - control benchmark tests. ### Detailed Explanation #### Background 1. **Limitations of Existing Methods** - Existing policy optimization methods (such as TRPO, PPO, etc.) achieve stable policy learning by introducing biased but low - variance surrogate objective functions. - These methods require a small policy update magnitude to ensure that the surrogate objective function remains an accurate approximation of the target policy value. - However, this small - magnitude update limits the policy exploration space and may lead to slow convergence or getting trapped in local optimal solutions. 2. **Challenges of Importance Sampling** - Importance sampling can provide an unbiased estimate, but its variance may grow exponentially with the time step, making it not suitable for direct optimization. - Therefore, directly using importance sampling as an optimization objective is not practical. #### Proposed Method 1. **New Objective Function** - The paper proposes a new objective function \( L_{\alpha}^{\pi}(\tilde{\pi}) \) based on importance sampling. This function balances the approximation bias and variance by adjusting the parameter \(\alpha_t\). - Specifically, the objective function is defined as: \[ L_{\alpha}^{\pi}(\tilde{\pi}) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t \geq 0} \gamma^t \left( \prod_{i = 1}^{t} \left( \frac{\tilde{\pi}(a_i|s_i)}{\pi(a_i|s_i)} \right)^{\alpha_i^t} \right) A_{\pi}(s_t, a_t) \right] \] - Here, \(\alpha_t\) is a vector of length \(t + 1\), and each component \(\alpha_i^t\in[0, 1]\). 2. **Balancing Bias and Variance** - When \(\alpha_t=(0, 0, \ldots, 0, 1)\), the objective function degenerates into the existing surrogate objective function \(L_{\pi}(\tilde{\pi})\). - When \(\alpha_t=(1, 1, \ldots, 1)\), the objective function becomes pure importance sampling. - Intermediate values of \(\alpha_t\) can balance between bias and variance, thus finding a better optimization path. 3. **Theoretical Analysis** - The paper provides a theoretical analysis of the bias and variance of the new objective function, proving that the bias and variance can be effectively controlled by adjusting \(\alpha_t\). - For example, Lemma 1 shows that when using a sparse \(\alpha_t\) vector, the variance can be effectively controlled within a finite range. 4. **Experimental Verification** - On the standard Mujoco continuous - control tasks, the experimental results show that the proposed Approximate IS Policy Optimization method outperforms the existing PPO method and is more robust to the choice of hyperparameters. ### Summary This paper solves the problem in existing policy optimization methods that requires a small - magnitude update to ensure the accuracy of the surrogate objective function by introducing a new objective function \( L_{\alpha}^{\pi}(\tilde{\pi}) \) based on importance sampling.

Policy Optimization Through Approximate Importance Sampling

Policy Optimization via Importance Sampling

Importance Sampling Policy Evaluation with an Estimated Behavior Policy

Subsampled Optimization: Statistical Guarantees, Mean Squared Error Approximation, and Sampling Method

Low Variance Off-policy Evaluation with State-based Importance Sampling

Importance sampling-based approximate optimal planning and control

Importance Sampling for Minimization of Tail Risks: A Tutorial

Generalized Proximal Policy Optimization with Sample Reuse

Dual Approximation Policy Optimization

Bias Reduction in Sample-Based Optimization

Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling

Proximal Policy Optimization Smoothed Algorithm

Two Losses Are Better Than One: Faster Optimization Using a Cheaper Proxy

Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces

Proximal Policy Optimization Algorithms

An Analytical Update Rule for General Policy Optimization

Importance Sampled Stochastic Optimization for Variational Inference

Sample Average Approximation for Stochastic Programming with Equality Constraints

Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization

Universal Approximation of Parametric Optimization via Neural Networks with Piecewise Linear Policy Approximation

Bayesian Optimization with Approximate Set Kernels