Output Randomization: A Novel Defense for both White-box and Black-box Adversarial Models

Daniel Park,Haidar Khan,Azer Khan,Alex Gittens,Bülent Yener
DOI: https://doi.org/10.48550/arXiv.2107.03806
2021-07-08
Abstract:Adversarial examples pose a threat to deep neural network models in a variety of scenarios, from settings where the adversary has complete knowledge of the model in a "white box" setting and to the opposite in a "black box" setting. In this paper, we explore the use of output randomization as a defense against attacks in both the black box and white box models and propose two defenses. In the first defense, we propose output randomization at test time to thwart finite difference attacks in black box settings. Since this type of attack relies on repeated queries to the model to estimate gradients, we investigate the use of randomization to thwart such adversaries from successfully creating adversarial examples. We empirically show that this defense can limit the success rate of a black box adversary using the Zeroth Order Optimization attack to 0%. Secondly, we propose output randomization training as a defense against white box adversaries. Unlike prior approaches that use randomization, our defense does not require its use at test time, eliminating the Backward Pass Differentiable Approximation attack, which was shown to be effective against other randomization defenses. Additionally, this defense has low overhead and is easily implemented, allowing it to be used together with other defenses across various model architectures. We evaluate output randomization training against the Projected Gradient Descent attacker and show that the defense can reduce the PGD attack's success rate down to 12% when using cross-entropy loss.
Machine Learning,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the security issue of deep neural network models when facing adversarial examples attacks. Specifically, the author focuses on how to defend against two different types of adversarial attacks: white - box attacks and black - box attacks. The white - box attack assumes that the attacker has a complete understanding of the model's structure and parameters, while the black - box attack assumes that the attacker can only obtain output information by querying the model. ### Main Contributions 1. **Propose Output Randomization as a Defense Strategy**: - The author proposes to randomize the model output during the testing phase to defend against black - box attacks based on finite - difference estimation. - At the same time, the author also proposes a defense method that introduces output randomization during the training phase to resist white - box attacks. 2. **Defense Against Black - Box Attacks**: - Output randomization makes it difficult for black - box attacks based on finite - difference estimation to succeed by adding noise to the model output during the testing phase. Experimental results show that this method can reduce the success rate of Zeroth Order Optimization (ZOO) attacks to 0%. 3. **Defense Against White - Box Attacks**: - Introducing output randomization during the training phase does not require randomization during the testing phase, thus avoiding Backward Pass Differentiable Approximation (BPDA) attacks. Experiments show that this method can significantly improve the model's robustness to Projected Gradient Descent (PGD) attacks, especially when using the cross - entropy loss function. ### Mathematical Formulas and Explanations - **Gradient Error of Finite - Difference Estimation**: \[ g_i=\frac{L(f(x + h e_i)) - L(f(x - h e_i))}{2h} \] where \( g_i \) is the finite - difference estimated gradient of the \( i \) - th pixel, \( L \) is the loss function, \( f \) is the model, \( x \) is the input, \( h \) is a small constant, and \( e_i \) is a unit vector. - **Expected Value of Gradient Error After Output Randomization**: \[ |E[g_i-\gamma_i]|=\left| g_i - E\left[ \frac{L(p + \epsilon)-L(p'+\epsilon')}{2h} \right] \right| \] where \( \gamma_i \) is the gradient estimate calculated by the attacker, and \( \epsilon \) and \( \epsilon' \) are the noises added to the model output. - **ERM Problem with Noise**: \[ \min_{\theta} E_{(x,y)\sim P} E_{\epsilon}[L(f_{\theta}(x)+\epsilon,y)] \] where \( \theta \) is the model parameter, \( L \) is the loss function, and \( \epsilon\sim N(0,\Sigma) \) is Gaussian noise. ### Conclusion The paper shows that output randomization, as a simple and effective defense strategy, can significantly improve the model's robustness to adversarial attacks without affecting the model's performance. In particular, for black - box attacks, output randomization almost completely prevents the success of the attacks; for white - box attacks, output randomization training significantly improves the model's defense ability.