Abstract:The frustratingly fragile nature of neural network models make current natural language generation (NLG) systems prone to backdoor attacks and generate malicious sequences that could be sexist or offensive. Unfortunately, little effort has been invested to how backdoor attacks can affect current NLG models and how to defend against these attacks. In this work, by giving a formal definition of backdoor attack and defense, we investigate this problem on two important NLG tasks, machine translation and dialog generation. Tailored to the inherent nature of NLG models (e.g., producing a sequence of coherent words given contexts), we design defending strategies against attacks. We find that testing the backward probability of generating sources given targets yields effective defense performance against all different types of attacks, and is able to handle the {\it one-to-many} issue in many NLG tasks such as dialog generation. We hope that this work can raise the awareness of backdoor risks concealed in deep NLG systems and inspire more future work (both attack and defense) towards this direction.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to prevent backdoor attacks in natural language generation (NLG) systems. Specifically, the paper focuses on how to defend against backdoor attacks in machine translation (NMT) and dialogue generation tasks. These attacks inject malicious samples into the training data, causing the model to generate harmful or inappropriate content, such as sexist or offensive responses, when encountering specific trigger words or patterns. The main contributions of the paper are as follows: 1. **Definition and Evaluation**: For the first time, the paper formally defines backdoor attacks and defenses in NLG tasks and develops an evaluation benchmark for this purpose, covering two important NLG tasks: machine translation and dialogue generation. 2. **Attack Verification**: The researchers carried out attacks against NLG systems and verified that deep NLG systems are vulnerable to attacks and can achieve a high success rate on the attack data while maintaining performance on clean data. 3. **Defense Strategies**: The paper proposes several general - purpose defense methods to detect and correct attacked inputs. These methods are specially designed to adapt to the characteristics of NLG models. In particular, the paper proposes a method based on the change in the probability of generating the source given the target (i.e., the reverse probability \(p(x|y)\)). This method can not only effectively resist various types of attacks but also handle the one - to - many problem in tasks such as dialogue generation. ### Specific Problems and Solutions #### Problem Description - **Backdoor Attacks**: During the training phase, attackers inject malicious samples into the training data, causing the model to generate harmful content when encountering specific trigger conditions. Such attacks may lead to serious economic, social, and security problems. - **Defense Challenges**: Due to the special nature of NLG tasks (such as generating coherent text sequences), the existing defense strategies in NLU tasks are not directly applicable to NLG tasks. #### Solutions - **Formal Definition**: The paper first formally defines backdoor attacks and defenses, clearly stating that the goal of the attack is to make the model generate malicious content under specific conditions without significantly affecting the normal performance of the model. - **Benchmark Construction**: Construct a benchmark data set for training and evaluation, including clean data and attack data, to test the defense ability of the model. - **Defense Methods**: - **Target Semantic Change**: Slightly perturb the source sentence and observe the semantic change of the generated target sentence. If a small change leads to a large semantic change, it may indicate that the source sentence is contaminated. - **Reverse Probability Change**: Detect contamination based on the change in the probability \(p(x|y)\) of generating the source given the target. This method can handle the one - to - many problem because even if the targets are different, as long as they are reasonable, the probabilities of their generating sources should be similar. ### Experimental Results - **Machine Translation**: The experimental results show that as the proportion of attack data increases, the BLEU score of the model on the attack test set increases significantly, while the BLEU score on the clean test set decreases slightly, proving the effectiveness of the attack. - **Dialogue Generation**: The dialogue generation model is also at risk of backdoor attacks, and as the attack data increases, the model's ability to generate malicious content increases. ### Conclusion Through formal definition, benchmark construction, and the design of defense methods, the paper provides a comprehensive research framework for backdoor attacks and defenses in NLG tasks. The proposed defense strategies can not only effectively detect and correct attacked inputs but also handle the one - to - many problem in NLG tasks, providing a new direction for future related research.

Defending Against Backdoor Attacks in Natural Language Generation

Rethinking Stealthiness of Backdoor Attack Against NLP Models.

Hidden Backdoors in Human-Centric Language Models

Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review

NLPSweep: A comprehensive defense scheme for mitigating NLP backdoor attacks

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

Neural Network Backdoor Attacks Fully Controlled by Composite Natural Utterance Fragments.

The triggers that open the NLP model backdoors are hidden in the adversarial samples

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models

Neutralizing Backdoors through Information Conflicts for Large Language Models

Triggerless Backdoor Attack for NLP Tasks with Clean Labels

RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models

BDDR: An Effective Defense Against Textual Backdoor Attacks

Backdoor Attacks with Input-unique Triggers in NLP

Expose Backdoors on the Way: A Feature-Based Efficient Defense Against Textual Backdoor Attacks

Turning Generative Models Degenerate: The Power of Data Poisoning Attacks

Regula Sub-rosa: Latent Backdoor Attacks on Deep Neural Networks

Moderate-fitting as a Natural Backdoor Defender for Pre-trained Language Models

De-Confounded Variational Encoder-Decoder for Logical Table-to-Text Generation.

MEGen: Generative Backdoor in Large Language Models via Model Editing