Leverage NLP Models Against Other NLP Models: Two Invisible Feature Space Backdoor Attacks

Xiangjun Li,Xin Lu,Peixuan Li
DOI: https://doi.org/10.1109/tr.2024.3375526
IF: 5.883
2024-01-01
IEEE Transactions on Reliability
Abstract:At present, deep neural networks are at risk from backdoor attacks, but natural language processing (NLP) lacks sufficient research on backdoor attacks. To improve the invisibility of backdoor attacks, some innovative textual backdoor attack methods utilize modern language models to generate poisoned text with backdoor triggers, which are called feature space backdoor attacks. However, this article find that texts generated by the same language model without backdoor triggers also have a high probability of activating the backdoors they injected. Therefore, this article proposes a multistyle transfer-based backdoor attack that uses multiple text styles as the backdoor trigger. Furthermore, inspired by the ability of modern language models to distinguish between texts generated by different language models, this article proposes a paraphrase-based backdoor attack, which leverages the shared characteristics of sentences generated by the same paraphrase model as the backdoor trigger. Experiments have been conducted to demonstrate that both backdoor attack methods can be effective against NLP models. More importantly, compared with other feature space backdoor attacks, the poisoned samples generated by paraphrase-based backdoor attacks have improved semantic similarity.
What problem does this paper attempt to address?