Abstract:Adversarial attacks by malicious actors on machine learning systems, such as introducing poison triggers into training datasets, pose significant risks. The challenge in resolving such an attack arises in practice when only a subset of the poisoned data can be identified. This necessitates the development of methods to remove, i.e. unlearn, poison triggers from already trained models with only a subset of the poison data available. The requirements for this task significantly deviate from privacy-focused unlearning where all of the data to be forgotten by the model is known. Previous work has shown that the undiscovered poisoned samples lead to a failure of established unlearning methods, with only one method, Selective Synaptic Dampening (SSD), showing limited success. Even full retraining, after the removal of the identified poison, cannot address this challenge as the undiscovered poison samples lead to a reintroduction of the poison trigger in the model. Our work addresses two key challenges to advance the state of the art in poison unlearning. First, we introduce a novel outlier-resistant method, based on SSD, that significantly improves model protection and unlearning performance. Second, we introduce Poison Trigger Neutralisation (PTN) search, a fast, parallelisable, hyperparameter search that utilises the characteristic "unlearning versus model protection" trade-off to find suitable hyperparameters in settings where the forget set size is unknown and the retain set is contaminated. We benchmark our contributions using ResNet-9 on CIFAR10 and WideResNet-28x10 on CIFAR100. Experimental results show that our method heals 93.72% of poison compared to SSD with 83.41% and full retraining with 40.68%. We achieve this while also lowering the average model accuracy drop caused by unlearning from 5.68% (SSD) to 1.41% (ours).

Machine Unlearning Fails to Remove Data Poisoning Attacks

Releasing Malevolence from Benevolence: The Menace of Benign Data on Machine Unlearning

Hidden Poison: Machine Unlearning Enables Camouflaged Poisoning Attacks

Unlearn and Burn: Adversarial Machine Unlearning Requests Destroy Model Accuracy

Class Machine Unlearning for Complex Data via Concepts Inference and Data Poisoning

Potion: Towards Poison Unlearning

Backdoor Attacks via Machine Unlearning

Gone but Not Forgotten: Improved Benchmarks for Machine Unlearning

Learn to Unlearn: A Survey on Machine Unlearning

Adversarial Machine Unlearning

Exploiting Machine Unlearning for Backdoor Attacks in Deep Learning System

Machine unlearning through fine-grained model parameters perturbation

An Overview of Machine Unlearning

Learn What You Want to Unlearn: Unlearning Inversion Attacks against Machine Unlearning

Ensuring User Privacy and Model Security via Machine Unlearning: A Review

Certified Machine Unlearning via Noisy Stochastic Gradient Descent

Detection and Defense of Unlearnable Examples

Poisoning Attacks and Data Sanitization Mitigations for Machine Learning Models in Network Intrusion Detection Systems

Unlearnable Examples Detection via Iterative Filtering

Fair Machine Unlearning: Data Removal while Mitigating Disparities

Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning