Abstract:To protect the intellectual property of well-trained deep neural networks (DNNs), black-box watermarks, which are embedded into the prediction behavior of DNN models on a set of specially-crafted samples and extracted from suspect models using only API access, have gained increasing popularity in both academy and industry. Watermark robustness is usually implemented against attackers who steal the protected model and obfuscate its parameters for watermark removal. However, current robustness evaluations are primarily performed under moderate attacks or unrealistic settings. Existing removal attacks could only crack a small subset of the mainstream black-box watermarks, and fall short in four key aspects: incomplete removal, reliance on prior knowledge of the watermark, performance degradation, and high dependency on data. In this paper, we propose a watermark-agnostic removal attack called \textsc{Neural Dehydration} (\textit{abbrev.} \textsc{Dehydra}), which effectively erases all ten mainstream black-box watermarks from DNNs, with only limited or even no data dependence. In general, our attack pipeline exploits the internals of the protected model to recover and unlearn the watermark message. We further design target class detection and recovered sample splitting algorithms to reduce the utility loss and achieve data-free watermark removal on five of the watermarking schemes. We conduct comprehensive evaluation of \textsc{Dehydra} against ten mainstream black-box watermarks on three benchmark datasets and DNN architectures. Compared with existing removal attacks, \textsc{Dehydra} achieves strong removal effectiveness across all the covered watermarks, preserving at least $90\%$ of the stolen model utility, under the data-limited settings, i.e., less than $2\%$ of the training data or even data-free.

Removing Watermarks for Image Processing Networks Via Referenced Subspace Attention

Leveraging Unlabeled Data for Watermark Removal of Deep Neural Networks

REFIT: A UnifiedWatermark Removal Framework for Deep Learning Systems with Limited Data

Making Watermark Survive Model Extraction Attacks in Graph Neural Networks.

Deep Neural Network Watermarking Against Model Extraction Attack

MEA-Defender: A Robust Watermark against Model Extraction Attack

Deep Model Intellectual Property Protection Via Deep Watermarking

Watermarking Neural Networks with Watermarked Images

Robust Model Watermarking for Image Processing Networks via Structure Consistency

Neural Dehydration: Effective Erasure of Black-box Watermarks from DNNs with Limited Data

On the Robustness of the Backdoor-based Watermarking in Deep Neural Networks

Removing Backdoor-Based Watermarks in Neural Networks with Limited Data.

Embedding Watermarks into Deep Neural Networks

Subnetwork-Lossless Robust Watermarking for Hostile Theft Attacks in Deep Transfer Learning Models

Protecting Image Processing Networks Via Model Watermarking

On Function-Coupled Watermarks for Deep Neural Networks

Watermarking in Deep Neural Networks Via Error Back-propagation

Towards Robust Model Watermark Via Reducing Parametric Vulnerability

Digital watermarking for deep neural networks

Digital Hologram Watermarking Based on Multiple Deep Neural Networks Training Reconstruction and Attack

Split then Refine: Stacked Attention-guided ResUNets for Blind Single Image Visible Watermark Removal