Adversarial Policy Optimization in Deep Reinforcement Learning

Md Masudur Rahman,Yexiang Xue
DOI: https://doi.org/10.48550/arXiv.2304.14533
2023-04-28
Abstract:The policy represented by the deep neural network can overfit the spurious features in observations, which hamper a reinforcement learning agent from learning effective policy. This issue becomes severe in high-dimensional state, where the agent struggles to learn a useful policy. Data augmentation can provide a performance boost to RL agents by mitigating the effect of overfitting. However, such data augmentation is a form of prior knowledge, and naively applying them in environments might worsen an agent's performance. In this paper, we propose a novel RL algorithm to mitigate the above issue and improve the efficiency of the learned policy. Our approach consists of a max-min game theoretic objective where a perturber network modifies the state to maximize the agent's probability of taking a different action while minimizing the distortion in the state. In contrast, the policy network updates its parameters to minimize the effect of perturbation while maximizing the expected future reward. Based on this objective, we propose a practical deep reinforcement learning algorithm, Adversarial Policy Optimization (APO). Our method is agnostic to the type of policy optimization, and thus data augmentation can be incorporated to harness the benefit. We evaluated our approaches on several DeepMind Control robotic environments with high-dimensional and noisy state settings. Empirical results demonstrate that our method APO consistently outperforms the state-of-the-art on-policy PPO agent. We further compare our method with state-of-the-art data augmentation, RAD, and regularization-based approach DRAC. Our agent APO shows better performance compared to these baselines.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problem that in high - dimensional state spaces, reinforcement learning (RL) agents are prone to over - fitting to spurious features in observations, which hinders agents from learning effective policies. In addition, in high - dimensional states, it is difficult for agents to learn useful policies, and the presence of noise also makes policy learning more difficult. Although data augmentation can provide performance improvement, inappropriate application may worsen the performance of agents. Therefore, the paper proposes a new RL algorithm - Adversarial Policy Optimization (APO) - to alleviate the above problems and improve the effectiveness of the learned policies. Specifically, APO modifies the state by introducing an adversarial network (perturber network), with the goal of maximizing the probability that the agent takes different actions while minimizing the distortion of the state. Meanwhile, the policy network updates its parameters to minimize the influence of the adversarial network while maximizing the expected future rewards. This method trains the policy in an adversarial manner, making the policy more robust to high - dimensional and noisy states. The main contributions of the paper include: - Proposing the deep reinforcement learning algorithm APO for high - dimensional and noisy states. - Evaluating the effectiveness of the method in 10 DeepMind Control environment settings, which include high - dimensional and noisy states. - Experimental results show that APO outperforms PPO in all settings and generally outperforms data - augmentation - based methods RAD and DRAC. Through these contributions, APO not only improves the learning efficiency in high - dimensional and noisy states but also demonstrates robustness and effectiveness in complex environments.