Redeem Myself: Purifying Backdoors in Deep Learning Models Using Self Attention Distillation.

Xueluan Gong,Yanjiao Chen,Wang Yang,Qian Wang,Yuzhe Gu,Huayang Huang,Chao Shen
DOI: https://doi.org/10.1109/sp46215.2023.10179375
2023-01-01
Abstract:Recent works have revealed the vulnerability of deep neural networks to backdoor attacks, where a backdoored model orchestrates targeted or untargeted misclassification when activated by a trigger. A line of purification methods (e.g., fine-pruning, neural attention transfer, MCR [69]) have been proposed to remove the backdoor in a model. However, they either fail to reduce the attack success rate of more advanced backdoor attacks or largely degrade the prediction capacity of the model for clean samples. In this paper, we put forward a new purification defense framework, dubbed SAGE, which utilizes self-attention distillation to purge models of backdoors. Unlike traditional attention transfer mechanisms that require a teacher model to supervise the distillation process, SAGE can realize selfpurification with a small number of clean samples. To enhance the defense performance, we further propose a dynamic learning rate adjustment strategy that carefully tracks the prediction accuracy of clean samples to guide the learning rate adjustment. We compare the defense performance of SAGE with 6 stateof-the-art defense approaches against 8 backdoor attacks on 4 datasets. It is shown that SAGE can reduce the attack success rate by as much as 90% with less than 3% decrease in prediction accuracy for clean samples. We will open-source our codes upon publication.
What problem does this paper attempt to address?