Elijah: Eliminating Backdoors Injected in Diffusion Models via Distribution Shift

Shengwei An,Sheng-Yen Chou,Kaiyuan Zhang,Qiuling Xu,Guanhong Tao,Guangyu Shen,Siyuan Cheng,Shiqing Ma,Pin-Yu Chen,Tsung-Yi Ho,Xiangyu Zhang
2024-02-05
Abstract:Diffusion models (DM) have become state-of-the-art generative models because of their capability to generate high-quality images from noises without adversarial training. However, they are vulnerable to backdoor attacks as reported by recent studies. When a data input (e.g., some Gaussian noise) is stamped with a trigger (e.g., a white patch), the backdoored model always generates the target image (e.g., an improper photo). However, effective defense strategies to mitigate backdoors from DMs are underexplored. To bridge this gap, we propose the first backdoor detection and removal framework for DMs. We evaluate our framework Elijah on hundreds of DMs of 3 types including DDPM, NCSN and LDM, with 13 samplers against 3 existing backdoor attacks. Extensive experiments show that our approach can have close to 100% detection accuracy and reduce the backdoor effects to close to zero without significantly sacrificing the model utility.
Cryptography and Security,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to detect and remove backdoor attacks in Diffusion Models (DM). Specifically, the paper points out that although diffusion models have become one of the most advanced generative models due to their ability to generate high - quality images, they are vulnerable to backdoor attacks. When the input data (such as Gaussian noise) is implanted with a specific trigger (such as a white square), the model under backdoor attack will always generate the target image (such as an inappropriate photo). However, currently, there is still insufficient research on how to effectively defend against such attacks. Therefore, this paper proposes a new framework - ELIJAH, aiming to fill this gap by detecting and eliminating backdoors in diffusion models. The main contributions of the paper are as follows: 1. **Studied three existing backdoor attacks on diffusion models**, and proposed the first backdoor detection and removal framework for diffusion models. This framework can work without real clean data and proposes a trigger inversion method based on the distribution shift characteristics. 2. **Designed a uniformity score as an indicator to measure the consistency of generated images**, and combined it with the Total Variance loss to determine whether the model has been implanted with a backdoor. 3. **Proposed a backdoor removal algorithm** to eliminate the backdoor effect by reducing the model's distribution shift for the trigger. The experimental results show that this method has a detection accuracy close to 100% and can reduce the backdoor effect to almost zero while maintaining the normal performance of the model.