Rethinking Backdoor Attacks

Alaa Khaddaj,Guillaume Leclerc,Aleksandar Makelov,Kristian Georgiev,Hadi Salman,Andrew Ilyas,Aleksander Madry
2023-07-20
Abstract:In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation. Defending against such attacks typically involves viewing these inserted examples as outliers in the training set and using techniques from robust statistics to detect and remove them. In this work, we present a different approach to the backdoor attack problem. Specifically, we show that without structural information about the training data distribution, backdoor attacks are indistinguishable from naturally-occurring features in the data--and thus impossible to "detect" in a general sense. Then, guided by this observation, we revisit existing defenses against backdoor attacks and characterize the (often latent) assumptions they make and on which they depend. Finally, we explore an alternative perspective on backdoor attacks: one that assumes these attacks correspond to the strongest feature in the training data. Under this assumption (which we make formal) we develop a new primitive for detecting backdoor attacks. Our primitive naturally gives rise to a detection algorithm that comes with theoretical guarantees and is effective in practice.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the difficult problem of detecting backdoor attacks (Backdoor Attacks) in machine - learning models. Specifically, the paper points out that traditional defense methods usually regard backdoor attacks as outliers in the training data set and use robust statistical methods to detect and remove these outliers. However, the authors of the paper propose a different perspective, believing that without structural information about the training data distribution, the triggers in backdoor attacks cannot be distinguished from the features that naturally exist in the data, and thus are difficult to detect by traditional methods. ### Core Views of the Paper 1. **Indistinguishability between Backdoor Attacks and Natural Features**: - The authors point out that without prior knowledge of the data distribution, the triggers in backdoor attacks may be indistinguishable from the existing features in the data set. This means that the existing defense methods must be based on assumptions about the data or the attack structure to be effective. 2. **Redefining Defense Methods**: - The authors suggest redefining the problem of detecting backdoor attacks as the problem of detecting the strongest features in the data set. They assume that the backdoor trigger is the strongest feature in the data set and develop a new detection algorithm based on this assumption. ### Main Contributions 1. **Demonstrating the Indistinguishability between Backdoor Triggers and Natural Features**: - The authors show through experiments that backdoor triggers can look like the features that naturally exist in the data set and can even use the existing rare features in the data set to carry out backdoor attacks. 2. **Redefining the Detection Problem**: - The authors redefine the problem of detecting backdoor attacks as the problem of detecting the strongest features in the data set and propose a theoretically guaranteed and practically effective detection algorithm. 3. **Proposing a New Detection Algorithm**: - The authors develop a new detection algorithm that can identify the training samples containing the strongest features and remove them from the training set. The experimental results show that this algorithm performs well in a variety of standard backdoor attack scenarios. ### Formula Explanations - **Definition of Feature Support**: \[ \Phi(S)=\{z = (x, y)\in S\mid\phi(x) = 1\} \] This definition represents the support set of feature \(\phi\) in the training set \(S\), that is, the set of all training samples that activate feature \(\phi\). - **Feature Output Function**: \[ g_\phi(k)=\mathbb{E}_{z\sim\Phi(S)}\left[\mathbb{E}_{S'\sim D_S}\left[f(z; S')\mid|\Phi(S')| = k, z\notin S'\right]\right] \] This function represents the expected output of the model on the samples with feature \(\phi\) when there are exactly \(k\) samples containing feature \(\phi\) in the training set. - **Feature Strength**: \[ s_\phi(k)=g_\phi(k + 1)-g_\phi(k) \] This definition represents the strength of feature \(\phi\), that is, the rate of change of the model output when the number of samples containing feature \(\phi\) in the training set increases from \(k\) to \(k + 1\). - **Accuracy Assumption of the Data Modeling Framework**: \[ \mathbb{E}_{S'\sim D_S}\left[\left(\mathbb{E}[f(z; S')]-1^\top_{S'}w_z\right)^2\right]\leq\epsilon \] This assumption means that the data modeling framework can accurately estimate the model output function with an error not exceeding \(\epsilon\). ### Conclusion This paper effectively solves the challenges faced by existing methods in detecting backdoor attacks by redefining the detection problem of backdoor attacks and proposing a new perspective and algorithm. This method is not only theoretically proven but also performs well in practical applications.