Policy Optimization Through Approximate Importance Sampling

Marcin B. Tomczak,Dongho Kim,Peter Vrancx,Kee-Eung Kim
DOI: https://doi.org/10.48550/arXiv.1910.03857
2020-02-22
Abstract:Recent policy optimization approaches (Schulman et al., 2015a; 2017) have achieved substantial empirical successes by constructing new proxy optimization objectives. These proxy objectives allow stable and low variance policy learning, but require small policy updates to ensure that the proxy objective remains an accurate approximation of the target policy value. In this paper we derive an alternative objective that obtains the value of the target policy by applying importance sampling (IS). However, the basic importance sampled objective is not suitable for policy optimization, as it incurs too high variance in policy updates. We therefore introduce an approximation that allows us to directly trade-off the bias of approximation with the variance in policy updates. We show that our approximation unifies previously developed approaches and allows us to interpolate between them. We develop a practical algorithm by optimizing the introduced objective with proximal policy optimization techniques (Schulman et al., 2017). We also provide a theoretical analysis of the introduced policy optimization objective demonstrating bias-variance trade-off. We empirically demonstrate that the resulting algorithm improves upon state of the art on-policy policy optimization on continuous control benchmarks.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: In Policy Optimization, existing methods achieve stable and low - variance policy learning by constructing new surrogate optimization objectives, but these methods require a small policy update magnitude to ensure that the surrogate objective remains an accurate approximation of the target policy value. This paper proposes a new objective function based on Importance Sampling (IS), aiming to directly balance the approximation bias and the variance in policy updates, thereby improving the policy optimization performance in continuous - control benchmark tests. ### Detailed Explanation #### Background 1. **Limitations of Existing Methods** - Existing policy optimization methods (such as TRPO, PPO, etc.) achieve stable policy learning by introducing biased but low - variance surrogate objective functions. - These methods require a small policy update magnitude to ensure that the surrogate objective function remains an accurate approximation of the target policy value. - However, this small - magnitude update limits the policy exploration space and may lead to slow convergence or getting trapped in local optimal solutions. 2. **Challenges of Importance Sampling** - Importance sampling can provide an unbiased estimate, but its variance may grow exponentially with the time step, making it not suitable for direct optimization. - Therefore, directly using importance sampling as an optimization objective is not practical. #### Proposed Method 1. **New Objective Function** - The paper proposes a new objective function \( L_{\alpha}^{\pi}(\tilde{\pi}) \) based on importance sampling. This function balances the approximation bias and variance by adjusting the parameter \(\alpha_t\). - Specifically, the objective function is defined as: \[ L_{\alpha}^{\pi}(\tilde{\pi}) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t \geq 0} \gamma^t \left( \prod_{i = 1}^{t} \left( \frac{\tilde{\pi}(a_i|s_i)}{\pi(a_i|s_i)} \right)^{\alpha_i^t} \right) A_{\pi}(s_t, a_t) \right] \] - Here, \(\alpha_t\) is a vector of length \(t + 1\), and each component \(\alpha_i^t\in[0, 1]\). 2. **Balancing Bias and Variance** - When \(\alpha_t=(0, 0, \ldots, 0, 1)\), the objective function degenerates into the existing surrogate objective function \(L_{\pi}(\tilde{\pi})\). - When \(\alpha_t=(1, 1, \ldots, 1)\), the objective function becomes pure importance sampling. - Intermediate values of \(\alpha_t\) can balance between bias and variance, thus finding a better optimization path. 3. **Theoretical Analysis** - The paper provides a theoretical analysis of the bias and variance of the new objective function, proving that the bias and variance can be effectively controlled by adjusting \(\alpha_t\). - For example, Lemma 1 shows that when using a sparse \(\alpha_t\) vector, the variance can be effectively controlled within a finite range. 4. **Experimental Verification** - On the standard Mujoco continuous - control tasks, the experimental results show that the proposed Approximate IS Policy Optimization method outperforms the existing PPO method and is more robust to the choice of hyperparameters. ### Summary This paper solves the problem in existing policy optimization methods that requires a small - magnitude update to ensure the accuracy of the surrogate objective function by introducing a new objective function \( L_{\alpha}^{\pi}(\tilde{\pi}) \) based on importance sampling.