Abstract:The policy represented by the deep neural network can overfit the spurious features in observations, which hamper a reinforcement learning agent from learning effective policy. This issue becomes severe in high-dimensional state, where the agent struggles to learn a useful policy. Data augmentation can provide a performance boost to RL agents by mitigating the effect of overfitting. However, such data augmentation is a form of prior knowledge, and naively applying them in environments might worsen an agent's performance. In this paper, we propose a novel RL algorithm to mitigate the above issue and improve the efficiency of the learned policy. Our approach consists of a max-min game theoretic objective where a perturber network modifies the state to maximize the agent's probability of taking a different action while minimizing the distortion in the state. In contrast, the policy network updates its parameters to minimize the effect of perturbation while maximizing the expected future reward. Based on this objective, we propose a practical deep reinforcement learning algorithm, Adversarial Policy Optimization (APO). Our method is agnostic to the type of policy optimization, and thus data augmentation can be incorporated to harness the benefit. We evaluated our approaches on several DeepMind Control robotic environments with high-dimensional and noisy state settings. Empirical results demonstrate that our method APO consistently outperforms the state-of-the-art on-policy PPO agent. We further compare our method with state-of-the-art data augmentation, RAD, and regularization-based approach DRAC. Our agent APO shows better performance compared to these baselines.

What problem does this paper attempt to address?

This paper attempts to solve the problem that in high - dimensional state spaces, reinforcement learning (RL) agents are prone to over - fitting to spurious features in observations, which hinders agents from learning effective policies. In addition, in high - dimensional states, it is difficult for agents to learn useful policies, and the presence of noise also makes policy learning more difficult. Although data augmentation can provide performance improvement, inappropriate application may worsen the performance of agents. Therefore, the paper proposes a new RL algorithm - Adversarial Policy Optimization (APO) - to alleviate the above problems and improve the effectiveness of the learned policies. Specifically, APO modifies the state by introducing an adversarial network (perturber network), with the goal of maximizing the probability that the agent takes different actions while minimizing the distortion of the state. Meanwhile, the policy network updates its parameters to minimize the influence of the adversarial network while maximizing the expected future rewards. This method trains the policy in an adversarial manner, making the policy more robust to high - dimensional and noisy states. The main contributions of the paper include: - Proposing the deep reinforcement learning algorithm APO for high - dimensional and noisy states. - Evaluating the effectiveness of the method in 10 DeepMind Control environment settings, which include high - dimensional and noisy states. - Experimental results show that APO outperforms PPO in all settings and generally outperforms data - augmentation - based methods RAD and DRAC. Through these contributions, APO not only improves the learning efficiency in high - dimensional and noisy states but also demonstrates robustness and effectiveness in complex environments.

Adversarial Policy Optimization in Deep Reinforcement Learning

Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets

Augmented Proximal Policy Optimization for Safe Reinforcement Learning

Absolute Policy Optimization

Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Adversarial Policies: Attacking Deep Reinforcement Learning

Orthogonal Adversarial Deep Reinforcement Learning for Discrete- and Continuous-Action Problems

Online Robust Policy Learning in the Presence of Unknown Adversaries

A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective

Adaptation Augmented Model-based Policy Optimization.

Probabilistic Perspectives on Error Minimization in Adversarial Reinforcement Learning

Attacking Deep Reinforcement Learning with Decoupled Adversarial Policy

Monotonic Robust Policy Optimization with Model Discrepancy.

Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning

Online Policy Optimization for Robust MDP

Attacking and Defending Deep Reinforcement Learning Policies

Robust Deep Reinforcement Learning with Adaptive Adversarial Perturbations in Action Space

Discovered Policy Optimisation

Adversary Agnostic Robust Deep Reinforcement Learning

Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts

Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards