Abstract:The policy represented by the deep neural network can overfit the spurious features in observations, which hamper a reinforcement learning agent from learning effective policy. This issue becomes severe in high-dimensional state, where the agent struggles to learn a useful policy. Data augmentation can provide a performance boost to RL agents by mitigating the effect of overfitting. However, such data augmentation is a form of prior knowledge, and naively applying them in environments might worsen an agent's performance. In this paper, we propose a novel RL algorithm to mitigate the above issue and improve the efficiency of the learned policy. Our approach consists of a max-min game theoretic objective where a perturber network modifies the state to maximize the agent's probability of taking a different action while minimizing the distortion in the state. In contrast, the policy network updates its parameters to minimize the effect of perturbation while maximizing the expected future reward. Based on this objective, we propose a practical deep reinforcement learning algorithm, Adversarial Policy Optimization (APO). Our method is agnostic to the type of policy optimization, and thus data augmentation can be incorporated to harness the benefit. We evaluated our approaches on several DeepMind Control robotic environments with high-dimensional and noisy state settings. Empirical results demonstrate that our method APO consistently outperforms the state-of-the-art on-policy PPO agent. We further compare our method with state-of-the-art data augmentation, RAD, and regularization-based approach DRAC. Our agent APO shows better performance compared to these baselines.

Boosting Nonparametric Policies.

PolicyBoost: Functional Policy Gradient with Ranking-based Reward Objective

An Off-Policy Trust Region Policy Optimization Method with Monotonic Improvement Guarantee for Deep Reinforcement Learning

Boosted Off-Policy Learning

Stochastic Cubic-Regularized Policy Gradient Method

Boosting Weak-to-Strong Agents in Multiagent Reinforcement Learning via Balanced PPO

Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces

Absolute Policy Optimization

Acceleration in Policy Optimization

Population-Guided Parallel Policy Search for Reinforcement Learning

QUANTILE-BASED POLICY OPTIMIZATION FOR REINFORCEMENT LEARNING

Adversarial Policy Optimization in Deep Reinforcement Learning

Generalizable Policy Improvement Via Reinforcement Sampling (student Abstract)

SPO: Sequential Monte Carlo Policy Optimisation

A nearly Blackwell-optimal policy gradient method

Model-free Policy Learning with Reward Gradients

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

Improving Policy Optimization with Generalist-Specialist Learning

Greedy-Step Off-Policy Reinforcement Learning

Monte Carlo Tree Search for Policy Optimization.

Behind the Myth of Exploration in Policy Gradients