Abstract:The frustratingly fragile nature of neural network models make current natural language generation (NLG) systems prone to backdoor attacks and generate malicious sequences that could be sexist or offensive. Unfortunately, little effort has been invested to how backdoor attacks can affect current NLG models and how to defend against these attacks. In this work, by giving a formal definition of backdoor attack and defense, we investigate this problem on two important NLG tasks, machine translation and dialog generation. Tailored to the inherent nature of NLG models (e.g., producing a sequence of coherent words given contexts), we design defending strategies against attacks. We find that testing the backward probability of generating sources given targets yields effective defense performance against all different types of attacks, and is able to handle the {\it one-to-many} issue in many NLG tasks such as dialog generation. We hope that this work can raise the awareness of backdoor risks concealed in deep NLG systems and inspire more future work (both attack and defense) towards this direction.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to prevent backdoor attacks in natural language generation (NLG) systems. Specifically, the paper focuses on how to defend against backdoor attacks in machine translation (NMT) and dialogue generation tasks. These attacks inject malicious samples into the training data, causing the model to generate harmful or inappropriate content, such as sexist or offensive responses, when encountering specific trigger words or patterns. The main contributions of the paper are as follows:
1. **Definition and Evaluation**: For the first time, the paper formally defines backdoor attacks and defenses in NLG tasks and develops an evaluation benchmark for this purpose, covering two important NLG tasks: machine translation and dialogue generation.
2. **Attack Verification**: The researchers carried out attacks against NLG systems and verified that deep NLG systems are vulnerable to attacks and can achieve a high success rate on the attack data while maintaining performance on clean data.
3. **Defense Strategies**: The paper proposes several general - purpose defense methods to detect and correct attacked inputs. These methods are specially designed to adapt to the characteristics of NLG models. In particular, the paper proposes a method based on the change in the probability of generating the source given the target (i.e., the reverse probability \(p(x|y)\)). This method can not only effectively resist various types of attacks but also handle the one - to - many problem in tasks such as dialogue generation.
### Specific Problems and Solutions
#### Problem Description
- **Backdoor Attacks**: During the training phase, attackers inject malicious samples into the training data, causing the model to generate harmful content when encountering specific trigger conditions. Such attacks may lead to serious economic, social, and security problems.
- **Defense Challenges**: Due to the special nature of NLG tasks (such as generating coherent text sequences), the existing defense strategies in NLU tasks are not directly applicable to NLG tasks.
#### Solutions
- **Formal Definition**: The paper first formally defines backdoor attacks and defenses, clearly stating that the goal of the attack is to make the model generate malicious content under specific conditions without significantly affecting the normal performance of the model.
- **Benchmark Construction**: Construct a benchmark data set for training and evaluation, including clean data and attack data, to test the defense ability of the model.
- **Defense Methods**:
- **Target Semantic Change**: Slightly perturb the source sentence and observe the semantic change of the generated target sentence. If a small change leads to a large semantic change, it may indicate that the source sentence is contaminated.
- **Reverse Probability Change**: Detect contamination based on the change in the probability \(p(x|y)\) of generating the source given the target. This method can handle the one - to - many problem because even if the targets are different, as long as they are reasonable, the probabilities of their generating sources should be similar.
### Experimental Results
- **Machine Translation**: The experimental results show that as the proportion of attack data increases, the BLEU score of the model on the attack test set increases significantly, while the BLEU score on the clean test set decreases slightly, proving the effectiveness of the attack.
- **Dialogue Generation**: The dialogue generation model is also at risk of backdoor attacks, and as the attack data increases, the model's ability to generate malicious content increases.
### Conclusion
Through formal definition, benchmark construction, and the design of defense methods, the paper provides a comprehensive research framework for backdoor attacks and defenses in NLG tasks. The proposed defense strategies can not only effectively detect and correct attacked inputs but also handle the one - to - many problem in NLG tasks, providing a new direction for future related research.