Counter-Samples: A Stateless Strategy to Neutralize Black Box Adversarial Attacks

Roey Bokobza,Yisroel Mirsky
2024-03-14
Abstract:Our paper presents a novel defence against black box attacks, where attackers use the victim model as an oracle to craft their adversarial examples. Unlike traditional preprocessing defences that rely on sanitizing input samples, our stateless strategy counters the attack process itself. For every query we evaluate a counter-sample instead, where the counter-sample is the original sample optimized against the attacker's objective. By countering every black box query with a targeted white box optimization, our strategy effectively introduces an asymmetry to the game to the defender's advantage. This defence not only effectively misleads the attacker's search for an adversarial example, it also preserves the model's accuracy on legitimate inputs and is generic to multiple types of attacks.
Cryptography and Security,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to defend against black - box adversarial attacks against deep - learning models. Specifically, the author proposes a novel stateless strategy, which effectively resists these attacks by generating counter - samples to mislead attackers. The following is a detailed interpretation of this problem: ### 1. **Problem Background** Deep neural networks are vulnerable to adversarial samples, which are deliberately designed by making small perturbations to the input in order to induce the model to produce misclassifications. The generation of adversarial samples can be achieved through the following optimization objective: \[ \delta^* = \arg \min_{\delta} \|\delta\|_p \quad \text{subject to} \quad f(x + \delta) \neq f(x) \quad \text{and} \quad \|\delta\|_p \leq \epsilon \] Here, \( f \) represents the attacked model, and the condition \( \|\delta\|_p \leq \epsilon \) ensures that the adversarial sample \( x' = x + \delta \) is almost indistinguishable from the original sample \( x \) visually. In white - box attacks, the attacker can access the internal parameters of the model and directly calculate the gradient to find \( \delta \). However, in many practical applications, the attacker can only obtain information by querying the model, which is called black - box attack. Black - box attacks usually rely on querying the model to estimate the loss and gradually optimize \( \delta \). ### 2. **Limitations of Existing Methods** Existing defense mechanisms are mainly divided into three categories: pre - processing, detection, and model - strengthening techniques. Although pre - processing methods do not prevent individual samples, they may reduce the performance of the model on normal inputs and have poor defense effects against adaptive attackers. ### 3. **Solution Proposed in the Paper** To solve the above problems, the author proposes a new pre - processing method - **counter - sample defense**. The main features of this method are as follows: - **Statelessness**: There is no need to track the user's historical queries, so it has good scalability. - **Asymmetry in Optimization Capabilities**: Utilize the difference in capabilities between the attacker and the defender. The attacker is limited to black - box queries, while the defender can perform multiple white - box optimizations. - **Preservation of Clean Task Performance**: It will not significantly affect the accuracy of the model on normal inputs. ### 4. **Specific Method** For each query \( x_t \), the defender generates a counter - sample \( x_t^* \) such that \( x_t^* \) is closer to its predicted category. This process can be achieved by gradient descent: \[ x_{t + 1}^* = x_t^* - \alpha \nabla_{x_t^*} L(f(x_t^*; \theta), \hat{y}) \] Here, \( \alpha \) is the learning rate, \( \nabla_{x_t^*} L \) is the gradient of the loss function \( L \) with respect to \( x_t^* \), and \( \hat{y} \) is the label predicted by the model. In this way, the defender can mislead the attacker in each attack iteration, making it difficult for the attacker to find effective adversarial samples. ### 5. **Experimental Results** The author evaluated this method on the CIFAR - 10 and ImageNet datasets, and the results show that it can effectively defend against a variety of state - of - the - art black - box attacks and is superior to other defense methods in maintaining the model's clean - task performance. In conclusion, this paper proposes an innovative stateless defense strategy.