Detecting Adversarial Examples Via Reconstruction-based Semantic Inconsistency

Chi Zhang,Wenbo Zhou,Kui Zhang,Jie Zhang,Weiming Zhang,Nenghai Yu
DOI: https://doi.org/10.1145/3674399.3674448
2024-01-01
Abstract:Adversarial attacks have been demonstrated a huge threat to the field of artificial intelligence security. To address it, adversarial training is proposed, but it requires a high computation cost and will degrade the original performance. For easy deployment, some detection solutions have emerged. However, most existing methods mainly leverage the vulnerability of adversarial perturbation against input-level processing and differences in distribution such as reconstruction or bit reduction, but it is difficult for them to detect attacks with different perturbation patterns and strengths. To address it, a very recent method, ContraNet, leverages the victim model’s prediction to guide the input reconstruction, but this will decrease the detection rate on clean images. This paper also focuses on input-reconstruction but utilizes semantic similarity comparison. Different from ContraNet, we attempt to amplify the semantic inconsistency between the adversarial example and its reconstruction format based on the intrinsic features of the target classifier. In other words, this paper proposes a reconstruction-based detection via intrinsic features of the classifier to explore adversarial examples from the perspective of the target classifier rather than the distribution in pixel space perceived by humans. Based on this, a feature extractor can be learned in unsupervised learning through a modified SimCLR to perform semantic extraction, while the reconstruction of adversarial examples has inconsistent semantic information on the pixels, thus distinguishing clean samples and AEs. Our method can effectively defend against various disturbances and different types of attacks, maintaining a high detection robustness accuracy, and a high clean detection rate.
What problem does this paper attempt to address?