Abstract:Recent studies show that machine learning models are vulnerable to model extraction attacks, where the adversary builds a substitute model that achieves almost the same performance of a black-box victim model simply via querying the victim model. To defend against such attacks, a series of methods have been proposed to disrupt the query results before returning them to potential attackers, greatly degrading the performance of existing model extraction attacks. In this paper, we make the first attempt to develop a defensepenetrating model extraction attack framework, named D- DAE, which aims to break disruption-based defenses. The linchpins of D- DAE are the design of two modules, i.e., disruption detection and disruption recovery, which can be integrated with generic model extraction attacks. More specifically, after obtaining query results from the victim model, the disruption detection module infers the defense mechanism adopted by the defender. We design a meta-learning-based disruption detection algorithm for learning the fundamental differences between the distributions of disrupted and undisrupted query results. The algorithm features a good generalization property even if we have no access to the original training dataset of the victim model. Given the detected defense mechanism, the disruption recovery module tries to restore a clean query result from the disrupted query result with well-designed generative models. Our extensive evaluations on MNIST, FashionMNIST, CIFAR-10, GTSRB, and ImageNette datasets demonstrate that D- DAE can enhance the substitute model accuracy of the existing model extraction attacks by as much as 82.24% in the face of 4 state-of-the-art defenses and combinations of multiple defenses. We also verify the effectiveness of D-DAE in penetrating unknown defenses in real-world APIs hosted by Microsoft Azure and Face++.

Toward Improving the Robustness of Deep Learning Models Via Model Transformation.

D-DAE: Defense-Penetrating Model Extraction Attacks.

Improving robustness of deep neural networks via large-difference transformation

Improving Model Robustness Against Adversarial Examples with Redundant Fully Connected Layer.

Robust Training Using Natural Transformation

Towards Deep Learning Models Resistant to Transfer-based Adversarial Attacks via Data-centric Robust Learning

Impact of Architectural Modifications on Deep Learning Adversarial Robustness

Towards Robustness of Deep Program Processing Models—Detection, Estimation, and Enhancement

Improving the Robustness of Deep Neural Networks via Adversarial Training with Triplet Loss

A Framework for Robust Deep Learning Models Against Adversarial Attacks Based on a Protection Layer Approach

Adversarial robustness improvement for deep neural networks

Mutual Learning-Based Framework for Enhancing Robustness of Code Models Via Adversarial Training

Improving Adversarial Transferability by Stable Diffusion

Enhancing adversarial robustness for deep metric learning via neural discrete adversarial training

Improving Deep Neural Network Robustness with Siamese Empowered Adversarial Training

Exploring Robust Features for Improving Adversarial Robustness

Model-Agnostic Meta-Attack: Towards Reliable Evaluation of Adversarial Robustness

DeepDefense: Training Deep Neural Networks with Improved Robustness.

Towards Deep Learning Models Resistant to Adversarial Attacks

CoopHance: Cooperative Enhancement for Robustness of Deep Learning Systems