Abstract:We propose the first black-box targeted attack against online deep reinforcement learning through reward poisoning during training time. Our attack is applicable to general environments with unknown dynamics learned by unknown algorithms and requires limited attack budgets and computational resources. We leverage a general framework and find conditions to ensure efficient attack under a general assumption of the learning algorithms. We show that our attack is optimal in our framework under the conditions. We experimentally verify that with limited budgets, our attack efficiently leads the learning agent to various target policies under a diverse set of popular DRL environments and state-of-the-art learners.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to implement a target - reward poisoning attack in a black - box environment in deep reinforcement learning (DRL), so that the attacker can guide the learning agent to learn a specific target policy by tampering with the reward signal during the training process. This attack is applicable to general environments with unknown dynamic characteristics and unknown algorithms, and only requires limited attack budgets and computational resources.
### Problem Background
Online deep reinforcement learning (DRL) algorithms have great potential in industrial applications, such as robot control and recommendation systems. However, in these application scenarios, the reward signals during training usually depend on human feedback, which brings the threat of reward - poisoning attacks during training time. Attackers can manipulate DRL agents to learn specific policies with bad attributes by providing malicious rewards, and thus benefit. These policies may perform well on the performance indicators of the agents, but are actually unsafe behaviors and are difficult to detect.
### Limitations of Existing Research
Previous research has mainly focused on attacks in simpler tabular settings [Rakhsha et al., 2020, Xu et al., 2021, Zhang et al., 2020, Banihashem et al., 2022]. These studies have revealed the vulnerability of current tabular RL algorithms, but they are only effective in white - box settings, which may be impractical in practical applications. In addition, these methods usually require a comprehensive understanding of the environment and the learning agent, which becomes infeasible in continuous DRL environments.
### Main Contributions of the Paper
- **Propose black - box target attack for the first time**: This paper proposes the first black - box target - reward poisoning attack for online DRL.
- **Theoretical analysis and experimental verification**: Provide detailed theoretical analysis and experimental results to prove that this attack can successfully mislead the agent to learn the target policy with limited budgets and computational resources.
- **Wide applicability**: This attack is applicable to multiple environments and state - of - the - art DRL algorithms, and has been experimentally verified in various popular DRL environments, including MountainCar and HalfCheetah, as well as algorithms such as TD3 and Double Dueling DQN.
### Attack Framework and Method
To achieve this goal, the author designed an attack framework to train the agent in a static adversarial environment by perturbing the real environment. By assuming the behavior of an efficient learning algorithm, the author found the conditions to ensure an efficient attack and developed an optimal attack method that can be constructed with a minimum budget and very limited computational resources in a black - box setting.
### Experimental Results
The experimental results show that even with a limited budget, this attack can successfully make the agent take actions close to the target actions during the training process. The experiments cover different types of environments and learning algorithms and consider multiple target policies, verifying the effectiveness and wide applicability of the attack.
### Conclusion
This paper solves the problem of target - reward poisoning attacks on online deep reinforcement learning in a black - box environment, shows the vulnerability of existing DRL algorithms under such attacks, and provides a reference for future defense measures.