Abstract:Recent studies show that machine learning models are vulnerable to model extraction attacks, where the adversary builds a substitute model that achieves almost the same performance of a black-box victim model simply via querying the victim model. To defend against such attacks, a series of methods have been proposed to disrupt the query results before returning them to potential attackers, greatly degrading the performance of existing model extraction attacks. In this paper, we make the first attempt to develop a defensepenetrating model extraction attack framework, named D- DAE, which aims to break disruption-based defenses. The linchpins of D- DAE are the design of two modules, i.e., disruption detection and disruption recovery, which can be integrated with generic model extraction attacks. More specifically, after obtaining query results from the victim model, the disruption detection module infers the defense mechanism adopted by the defender. We design a meta-learning-based disruption detection algorithm for learning the fundamental differences between the distributions of disrupted and undisrupted query results. The algorithm features a good generalization property even if we have no access to the original training dataset of the victim model. Given the detected defense mechanism, the disruption recovery module tries to restore a clean query result from the disrupted query result with well-designed generative models. Our extensive evaluations on MNIST, FashionMNIST, CIFAR-10, GTSRB, and ImageNette datasets demonstrate that D- DAE can enhance the substitute model accuracy of the existing model extraction attacks by as much as 82.24% in the face of 4 state-of-the-art defenses and combinations of multiple defenses. We also verify the effectiveness of D-DAE in penetrating unknown defenses in real-world APIs hosted by Microsoft Azure and Face++.

Class-Disentanglement and Applications in Adversarial Detection and Defense

D-DAE: Defense-Penetrating Model Extraction Attacks.

Disentangling Factors of Variation in Deep Representations Using Adversarial Training.

Learning from Attacks: Attacking Variational Autoencoder for Improving Image Classification

MAD-VAE: Manifold Awareness Defense Variational Autoencoder

Detection of Adversarial Attacks via Disentangling Natural Images and Perturbations

Disentangled Deep Autoencoding Regularization for Robust Image Classification

Defending against adversarial attacks using spherical sampling-based variational auto-encoder

Encryption Inspired Adversarial Defense for Visual Classification

DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations

Proper Network Interpretability Helps Adversarial Robustness in Classification

Intriguing Properties of Adversarial Examples

Unsupervised Adversarial Perturbation Eliminating Via Disentangled Representations.

Interpretable adversarial example detection via high-level concept activation vector

Adversarial Defense based on Structure-to-Signal Autoencoders

Trust but Verify: An Information-Theoretic Explanation for the Adversarial Fragility of Machine Learning Systems, and a General Defense against Adversarial Attacks

DefenseVGAE: Defending against Adversarial Attacks on Graph Data via a Variational Graph Autoencoder

Adversarial Examples Detection Beyond Image Space.

Attack Agnostic Adversarial Defense via Visual Imperceptible Bound

D2Defend: Dual-Domain based Defense against Adversarial Examples

Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples