Abstract:The policy represented by the deep neural network can overfit the spurious features in observations, which hamper a reinforcement learning agent from learning effective policy. This issue becomes severe in high-dimensional state, where the agent struggles to learn a useful policy. Data augmentation can provide a performance boost to RL agents by mitigating the effect of overfitting. However, such data augmentation is a form of prior knowledge, and naively applying them in environments might worsen an agent's performance. In this paper, we propose a novel RL algorithm to mitigate the above issue and improve the efficiency of the learned policy. Our approach consists of a max-min game theoretic objective where a perturber network modifies the state to maximize the agent's probability of taking a different action while minimizing the distortion in the state. In contrast, the policy network updates its parameters to minimize the effect of perturbation while maximizing the expected future reward. Based on this objective, we propose a practical deep reinforcement learning algorithm, Adversarial Policy Optimization (APO). Our method is agnostic to the type of policy optimization, and thus data augmentation can be incorporated to harness the benefit. We evaluated our approaches on several DeepMind Control robotic environments with high-dimensional and noisy state settings. Empirical results demonstrate that our method APO consistently outperforms the state-of-the-art on-policy PPO agent. We further compare our method with state-of-the-art data augmentation, RAD, and regularization-based approach DRAC. Our agent APO shows better performance compared to these baselines.

Maximum a Posteriori Policy Optimisation

Proximal Policy Optimization Algorithms

Proximal Policy Optimization with Future Rewards

Adversarial Policy Optimization in Deep Reinforcement Learning

A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes

Beyond the Boundaries of Proximal Policy Optimization

Absolute Policy Optimization

Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation

Discovered Policy Optimisation

Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets

Provably Efficient Exploration in Policy Optimization

Anti-Martingale Proximal Policy Optimization

Proximal policy optimization with model-based methods

Multi-Path Policy Optimization

Policy Optimization with Model-based Explorations

Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Proximal Policy Optimization Smoothed Algorithm

Fast-PPO: Proximal Policy Optimization with Optimal Baseline Method

Augmented Proximal Policy Optimization for Safe Reinforcement Learning

Ε-Maximum Critic Deep Deterministic Policy Gradient for Multi-agent Reinforcement Learning