Mitigating Backdoor Attacks using Activation-Guided Model Editing

Felix Hsieh,Huy H. Nguyen,AprilPyone MaungMaung,Dmitrii Usynin,Isao Echizen
2024-09-30
Abstract:Backdoor attacks compromise the integrity and reliability of machine learning models by embedding a hidden trigger during the training process, which can later be activated to cause unintended misbehavior. We propose a novel backdoor mitigation approach via machine unlearning to counter such backdoor attacks. The proposed method utilizes model activation of domain-equivalent unseen data to guide the editing of the model's weights. Unlike the previous unlearning-based mitigation methods, ours is computationally inexpensive and achieves state-of-the-art performance while only requiring a handful of unseen samples for unlearning. In addition, we also point out that unlearning the backdoor may cause the whole targeted class to be unlearned, thus introducing an additional repair step to preserve the model's utility after editing the model. Experiment results show that the proposed method is effective in unlearning the backdoor on different datasets and trigger patterns.
Computer Vision and Pattern Recognition,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the backdoor attack problem in machine - learning models. Backdoor attacks undermine the integrity and reliability of the model by embedding hidden triggers during the training process, resulting in unexpected behavior of the model when specific triggers are activated. The paper proposes a new backdoor mitigation method based on machine unlearning. It guides the editing of model weights by using model activations of domain - equivalent unseen data to combat such backdoor attacks. This method is not only computationally inexpensive but also can achieve state - of - the - art performance with only a small number of unseen samples. In addition, the paper also points out that unlearning backdoors may lead to the unlearning of the entire target category. Therefore, an additional repair step is introduced to maintain the practicality of the model after editing. Experimental results show that this method effectively unlearns backdoors under different datasets and trigger patterns.