Abstract:Adversarial examples pose a threat to deep neural network models in a variety of scenarios, from settings where the adversary has complete knowledge of the model in a "white box" setting and to the opposite in a "black box" setting. In this paper, we explore the use of output randomization as a defense against attacks in both the black box and white box models and propose two defenses. In the first defense, we propose output randomization at test time to thwart finite difference attacks in black box settings. Since this type of attack relies on repeated queries to the model to estimate gradients, we investigate the use of randomization to thwart such adversaries from successfully creating adversarial examples. We empirically show that this defense can limit the success rate of a black box adversary using the Zeroth Order Optimization attack to 0%. Secondly, we propose output randomization training as a defense against white box adversaries. Unlike prior approaches that use randomization, our defense does not require its use at test time, eliminating the Backward Pass Differentiable Approximation attack, which was shown to be effective against other randomization defenses. Additionally, this defense has low overhead and is easily implemented, allowing it to be used together with other defenses across various model architectures. We evaluate output randomization training against the Projected Gradient Descent attacker and show that the defense can reduce the PGD attack's success rate down to 12% when using cross-entropy loss.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security issue of deep neural network models when facing adversarial examples attacks. Specifically, the author focuses on how to defend against two different types of adversarial attacks: white - box attacks and black - box attacks. The white - box attack assumes that the attacker has a complete understanding of the model's structure and parameters, while the black - box attack assumes that the attacker can only obtain output information by querying the model. ### Main Contributions 1. **Propose Output Randomization as a Defense Strategy**: - The author proposes to randomize the model output during the testing phase to defend against black - box attacks based on finite - difference estimation. - At the same time, the author also proposes a defense method that introduces output randomization during the training phase to resist white - box attacks. 2. **Defense Against Black - Box Attacks**: - Output randomization makes it difficult for black - box attacks based on finite - difference estimation to succeed by adding noise to the model output during the testing phase. Experimental results show that this method can reduce the success rate of Zeroth Order Optimization (ZOO) attacks to 0%. 3. **Defense Against White - Box Attacks**: - Introducing output randomization during the training phase does not require randomization during the testing phase, thus avoiding Backward Pass Differentiable Approximation (BPDA) attacks. Experiments show that this method can significantly improve the model's robustness to Projected Gradient Descent (PGD) attacks, especially when using the cross - entropy loss function. ### Mathematical Formulas and Explanations - **Gradient Error of Finite - Difference Estimation**: \[ g_i=\frac{L(f(x + h e_i)) - L(f(x - h e_i))}{2h} \] where \( g_i \) is the finite - difference estimated gradient of the \( i \) - th pixel, \( L \) is the loss function, \( f \) is the model, \( x \) is the input, \( h \) is a small constant, and \( e_i \) is a unit vector. - **Expected Value of Gradient Error After Output Randomization**: \[ |E[g_i-\gamma_i]|=\left| g_i - E\left[ \frac{L(p + \epsilon)-L(p'+\epsilon')}{2h} \right] \right| \] where \( \gamma_i \) is the gradient estimate calculated by the attacker, and \( \epsilon \) and \( \epsilon' \) are the noises added to the model output. - **ERM Problem with Noise**: \[ \min_{\theta} E_{(x,y)\sim P} E_{\epsilon}[L(f_{\theta}(x)+\epsilon,y)] \] where \( \theta \) is the model parameter, \( L \) is the loss function, and \( \epsilon\sim N(0,\Sigma) \) is Gaussian noise. ### Conclusion The paper shows that output randomization, as a simple and effective defense strategy, can significantly improve the model's robustness to adversarial attacks without affecting the model's performance. In particular, for black - box attacks, output randomization almost completely prevents the success of the attacks; for white - box attacks, output randomization training significantly improves the model's defense ability.

Output Randomization: A Novel Defense for both White-box and Black-box Adversarial Models

Attack As Defense: Characterizing Adversarial Examples Using Robustness.

Diversity can be Transferred: Output Diversification for White- and Black-box Attacks

Towards Efficient Data Free Blackbox Adversarial Attack

Adversarial Attacks Neutralization via Data Set Randomization

Understanding the Robustness of Randomized Feature Defense Against Query-Based Adversarial Attacks

Towards Optimal Randomized Strategies in Adversarial Example Game.

On the Limitations of Stochastic Pre-processing Defenses

Randomized Adversarial Training via Taylor Expansion

Beware the Black-Box: on the Robustness of Recent Defenses to Adversarial Examples

An Empirical Investigation of Randomized Defenses against Adversarial Attacks

NATTACK: Learning the Distributions of Adversarial Examples for an Improved Black-Box Attack on Deep Neural Networks

Data-free Defense of Black Box Models Against Adversarial Attacks

Stochastic Substitute Training: A Gray-box Approach to Craft Adversarial Examples Against Gradient Obfuscation Defenses

Adversarial Defense Via Self-Orthogonal Randomization Super-Network.

GenAttack: Practical Black-box Attacks with Gradient-Free Optimization

Efficient Two-Step Adversarial Defense for Deep Neural Networks

Randomized Purifier Based on Low Adversarial Transferability for Adversarial Defense

HRAE: Hardware-assisted Randomization Against Adversarial Example Attacks

Certifiable Black-Box Attacks with Randomized Adversarial Examples: Breaking Defenses with Provable Confidence

An efficient adversarial example generation algorithm based on an accelerated gradient iterative fast gradient