Abstract:Recent studies show that machine learning models are vulnerable to model extraction attacks, where the adversary builds a substitute model that achieves almost the same performance of a black-box victim model simply via querying the victim model. To defend against such attacks, a series of methods have been proposed to disrupt the query results before returning them to potential attackers, greatly degrading the performance of existing model extraction attacks. In this paper, we make the first attempt to develop a defensepenetrating model extraction attack framework, named D- DAE, which aims to break disruption-based defenses. The linchpins of D- DAE are the design of two modules, i.e., disruption detection and disruption recovery, which can be integrated with generic model extraction attacks. More specifically, after obtaining query results from the victim model, the disruption detection module infers the defense mechanism adopted by the defender. We design a meta-learning-based disruption detection algorithm for learning the fundamental differences between the distributions of disrupted and undisrupted query results. The algorithm features a good generalization property even if we have no access to the original training dataset of the victim model. Given the detected defense mechanism, the disruption recovery module tries to restore a clean query result from the disrupted query result with well-designed generative models. Our extensive evaluations on MNIST, FashionMNIST, CIFAR-10, GTSRB, and ImageNette datasets demonstrate that D- DAE can enhance the substitute model accuracy of the existing model extraction attacks by as much as 82.24% in the face of 4 state-of-the-art defenses and combinations of multiple defenses. We also verify the effectiveness of D-DAE in penetrating unknown defenses in real-world APIs hosted by Microsoft Azure and Face++.

"Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced Distillation

D-DAE: Defense-Penetrating Model Extraction Attacks.

LMDX: Language Model-based Document Information Extraction and Localization

MEAOD: Model Extraction Attack against Object Detectors

A Systematic Investigation of Distilling Large Language Models into Cross-Encoders for Passage Re-ranking

Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models

Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment

REEF: Representation Encoding Fingerprints for Large Language Models

Model Leeching: An Extraction Attack Targeting LLMs

MEA-Defender: A Robust Watermark against Model Extraction Attack

Large Language Model Sentinel: LLM Agent for Adversarial Purification

Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction

AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Model Extraction Attack against Self-supervised Speech Models

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Large Language Model Sentinel: Advancing Adversarial Robustness by LLM Agent

Model Extraction Attacks Revisited

ELAD: Explanation-Guided Large Language Models Active Distillation

Entity Alignment with Noisy Annotations from Large Language Models