Susmit Jha,Sunny Raj,Steven Lawrence Fernandes,Sumit Kumar Jha,Somesh Jha,Gunjan Verma,Brian Jalaian,Ananthram Swami
Abstract:Attribution methods have been developed to explain the decision of a machine learning model on a given input. We use the Integrated Gradient method for finding attributions to define the causal neighborhood of an input by incrementally masking high attribution features. We study the robustness of machine learning models on benign and adversarial inputs in this neighborhood. Our study indicates that benign inputs are robust to the masking of high attribution features but adversarial inputs generated by the state-of-the-art adversarial attack methods such as DeepFool, FGSM, CW and PGD, are not robust to such masking. Further, our study demonstrates that this concentration of high-attribution features responsible for the incorrect decision is more pronounced in physically realizable adversarial examples. This difference in attribution of benign and adversarial inputs can be used to detect adversarial examples. Such a defense approach is independent of training data and attack method, and we demonstrate its effectiveness on digital and physically realizable perturbations.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the vulnerability of machine - learning models in the face of adversarial samples. Specifically, the author has studied how to detect adversarial samples through causal attribution analysis. The paper points out that although machine - learning models have achieved near - human performance on many tasks, they are very sensitive to adversarial attacks, which can trigger wrong decisions by making small and imperceptible modifications to the input. In addition, these models lack interpretability, which further limits their use in safety - critical or high - security applications.
### Core problems of the paper
1. **Vulnerability of adversarial samples**: Traditional machine - learning models are prone to errors when facing adversarial samples because these samples can trigger wrong decisions through high attribution values of a small number of features.
2. **Explanatory power of attribution methods**: The author uses attribution methods (such as the Integrated Gradients method) to explain the model's decision on a given input and defines the causal neighborhood of the input by gradually masking high - attribution features.
3. **Methods for detecting adversarial samples**: By comparing the behavior of natural inputs and adversarial inputs in the causal neighborhood, the author finds that natural inputs are robust to the masking of high - attribution features, while adversarial inputs are not. This difference can be used to detect adversarial samples.
### Main contributions
- **Defining the causal neighborhood**: The author defines a causal neighborhood of an input in the attribution space, which can be constructed by gradually masking high - attribution features. An effective adversarial attack needs to deceive not only the original model (System 1) but also the attribution analysis (System 2).
- **Analyzing robustness**: The author analyzes the robustness of natural inputs and adversarial inputs by gradually masking high - attribution features and shows the performance of adversarial samples generated by different adversarial attack methods in the causal neighborhood.
- **Proposing a defense method**: Based on Kahneman's two - system cognitive model, the author proposes a defense layer that does not rely on training data or knowledge of specific attack methods, but detects adversarial samples through attribution analysis.
### Experimental results
- **MNIST dataset**: The experimental results show that after masking the top 10% high - attribution features of natural images, 94% of the image labels remain unchanged; while for adversarial samples, after masking the top 20% high - attribution features, the label change rates are 72%, 67%, 53% and 68% (corresponding to FGSM, DeepFool, CW and PGD attacks respectively).
- **ImageNet dataset**: For ImageNet images, after masking the top 0.1% high - attribution features, 82% of the image labels remain unchanged; while for adversarial samples, after masking the top 0.4% high - attribution features, the label change rates are 76%, 71% and 75% (corresponding to FGSM attacks with different ε values respectively).
- **Physically implemented adversarial samples**: For physically implemented adversarial samples (such as sticker attacks), after masking the top 0.4% high - attribution features, 99.71% of banana - sticker attacks, 98.14% of toaster - sticker attacks and 99.20% of baseball - sticker attacks are detected.
### Conclusion
The author has successfully detected adversarial samples through attribution analysis and the method of gradually masking high - attribution features. This method is applicable not only to digital adversarial samples but also to physically implemented adversarial samples. The advantage of this method is that it does not rely on training data or knowledge of specific attack methods, but detects adversarial samples by analyzing the attribution results of the model. This provides a new idea for improving the robustness of machine - learning models.