Rethinking Backdoor Detection Evaluation for Language Models

Jun Yan,Wenjie Jacky Mo,Xiang Ren,Robin Jia
2024-08-31
Abstract:Backdoor attacks, in which a model behaves maliciously when given an attacker-specified trigger, pose a major security risk for practitioners who depend on publicly released language models. Backdoor detection methods aim to detect whether a released model contains a backdoor, so that practitioners can avoid such vulnerabilities. While existing backdoor detection methods have high accuracy in detecting backdoored models on standard benchmarks, it is unclear whether they can robustly identify backdoors in the wild. In this paper, we examine the robustness of backdoor detectors by manipulating different factors during backdoor planting. We find that the success of existing methods highly depends on how intensely the model is trained on poisoned data during backdoor planting. Specifically, backdoors planted with either more aggressive or more conservative training are significantly more difficult to detect than the default ones. Our results highlight a lack of robustness of existing backdoor detectors and the limitations in current benchmark construction.
Computation and Language,Cryptography and Security
What problem does this paper attempt to address?
This paper aims to explore the robustness issues of existing backdoor detection methods in practical applications. Specifically, the paper points out that although existing backdoor detection methods show high accuracy in standard benchmark tests, these methods may not be strong enough when facing backdoor attacks in the real world. By manipulating different factors in the backdoor implantation process (such as training intensity), the author finds that the success rate of existing detection methods is highly dependent on the training intensity of the model on poisoned data. In particular, backdoors implanted with more aggressive or more conservative training intensities are more difficult to detect than those under the default settings. ### Main Contributions 1. **Propose non - default training intensity as an adversarial evaluation protocol**: The author suggests using non - default training intensity to evaluate the robustness of backdoor detectors. 2. **Reveal the key weaknesses of existing backdoor detection methods**: The research finds that by simply adjusting the poisoning rate, learning rate, and number of training rounds, an attacker can create backdoor models that can bypass current detection methods. 3. **Analyze the reasons for detection failure**: The author analyzes in detail the specific reasons for detection failure under different training intensities and emphasizes the need to develop more robust detection techniques. ### Experimental Design - **Attack Settings**: The experiment uses two binary - classification datasets (SST - 2 and HSOL), and three mainstream NLP backdoor attack methods based on data poisoning (rare words, natural sentences, and uncommon syntactic structures). - **Detection Settings**: Two state - of - the - art NLP backdoor detection methods (PICCOLO and DBS) and a meta - classifier method are evaluated. - **Training Intensity**: Backdoor models with three different training intensities are generated: medium training, conservative training, and aggressive training. ### Main Results - **Significant Differences in Detection Accuracy**: There are significant differences in detection accuracy under different datasets and trigger forms. For example, for the SST - 2 dataset, the detection accuracy of PICCOLO under medium training intensity is almost zero. - **The Influence of Non - default Training Intensity**: Whether it is conservative training or aggressive training, it makes the backdoor more difficult to detect. Aggressive training has a greater impact on the detection effect of DBS and the meta - classifier, while conservative training has a greater impact on the detection effect of PICCOLO. ### Analysis - **Loss Landscape Analysis**: By visualizing the loss landscape, the author finds that the loss value of the model with conservative training at the real trigger point is high, which makes it difficult for the detection method to identify it as a backdoor trigger point even if it finds the minimum value. - **Feature Distribution Analysis**: Using T - SNE to visualize the extracted features, it is found that aggressive training causes significant changes in the feature distribution, which explains the performance degradation of the meta - classifier when dealing with these models. ### Conclusion The paper proposes an adversarial evaluation protocol based on the strategic adjustment of hyperparameters in the backdoor implantation process and finds that existing detection methods have robustness problems when facing backdoors with different training intensities. The author hopes that this work will promote the development of more robust backdoor detection techniques and more reliable evaluation benchmarks. ### Limitations - **Limited Research Scope**: The research only uses one victim model, two datasets, and three trigger forms, and does not cover larger - scale models or more diverse attack targets. - **No Solution Provided**: Although the weaknesses of existing detection methods are found, no specific improvement solutions are provided. ### Ethical Statement The author emphasizes that although the research provides methods to circumvent existing detection mechanisms, open discussion of these weaknesses is crucial for promoting the development of trustworthy AI. It is hoped that this work will encourage future research to develop more robust and effective defense measures.