Abstract:We propose the first black-box targeted attack against online deep reinforcement learning through reward poisoning during training time. Our attack is applicable to general environments with unknown dynamics learned by unknown algorithms and requires limited attack budgets and computational resources. We leverage a general framework and find conditions to ensure efficient attack under a general assumption of the learning algorithms. We show that our attack is optimal in our framework under the conditions. We experimentally verify that with limited budgets, our attack efficiently leads the learning agent to various target policies under a diverse set of popular DRL environments and state-of-the-art learners.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to implement a target - reward poisoning attack in a black - box environment in deep reinforcement learning (DRL), so that the attacker can guide the learning agent to learn a specific target policy by tampering with the reward signal during the training process. This attack is applicable to general environments with unknown dynamic characteristics and unknown algorithms, and only requires limited attack budgets and computational resources. ### Problem Background Online deep reinforcement learning (DRL) algorithms have great potential in industrial applications, such as robot control and recommendation systems. However, in these application scenarios, the reward signals during training usually depend on human feedback, which brings the threat of reward - poisoning attacks during training time. Attackers can manipulate DRL agents to learn specific policies with bad attributes by providing malicious rewards, and thus benefit. These policies may perform well on the performance indicators of the agents, but are actually unsafe behaviors and are difficult to detect. ### Limitations of Existing Research Previous research has mainly focused on attacks in simpler tabular settings [Rakhsha et al., 2020, Xu et al., 2021, Zhang et al., 2020, Banihashem et al., 2022]. These studies have revealed the vulnerability of current tabular RL algorithms, but they are only effective in white - box settings, which may be impractical in practical applications. In addition, these methods usually require a comprehensive understanding of the environment and the learning agent, which becomes infeasible in continuous DRL environments. ### Main Contributions of the Paper - **Propose black - box target attack for the first time**: This paper proposes the first black - box target - reward poisoning attack for online DRL. - **Theoretical analysis and experimental verification**: Provide detailed theoretical analysis and experimental results to prove that this attack can successfully mislead the agent to learn the target policy with limited budgets and computational resources. - **Wide applicability**: This attack is applicable to multiple environments and state - of - the - art DRL algorithms, and has been experimentally verified in various popular DRL environments, including MountainCar and HalfCheetah, as well as algorithms such as TD3 and Double Dueling DQN. ### Attack Framework and Method To achieve this goal, the author designed an attack framework to train the agent in a static adversarial environment by perturbing the real environment. By assuming the behavior of an efficient learning algorithm, the author found the conditions to ensure an efficient attack and developed an optimal attack method that can be constructed with a minimum budget and very limited computational resources in a black - box setting. ### Experimental Results The experimental results show that even with a limited budget, this attack can successfully make the agent take actions close to the target actions during the training process. The experiments cover different types of environments and learning algorithms and consider multiple target policies, verifying the effectiveness and wide applicability of the attack. ### Conclusion This paper solves the problem of target - reward poisoning attacks on online deep reinforcement learning in a black - box environment, shows the vulnerability of existing DRL algorithms under such attacks, and provides a reference for future defense measures.

Black-Box Targeted Reward Poisoning Attack Against Online Deep Reinforcement Learning

Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning

Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning

Reward Poisoning Attack Against Offline Reinforcement Learning

Online Poisoning Attack Against Reinforcement Learning under Black-box Environments

MARNet: Backdoor Attacks Against Cooperative Multi-Agent Reinforcement Learning

Adversarial Inception for Bounded Backdoor Poisoning in Deep Reinforcement Learning

SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents

Vulnerability-Aware Poisoning Mechanism for Online RL with Unknown Dynamics

BadRL: Sparse Targeted Backdoor Attack Against Reinforcement Learning

Offline Reward Perturbation Boosts Distributional Shift in Online RL

Reward Delay Attacks on Deep Reinforcement Learning

Optimal Attack and Defense for Reinforcement Learning

Behavior-Targeted Attack on Reinforcement Learning with Limited Access to Victim's Policy

Strategically-timed State-Observation Attacks on Deep Reinforcement Learning Agents

Deep-Attack over the Deep Reinforcement Learning

Reinforcement Learning For Data Poisoning on Graph Neural Networks

PARL: Poisoning Attacks Against Reinforcement Learning-based Recommender Systems

Is poisoning a real threat to LLM alignment? Maybe more so than you think

Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning

Local Environment Poisoning Attacks on Federated Reinforcement Learning