Abstract:The vulnerability of deep neural networks (DNN) to backdoor (trojan) attacks is extensively studied for the image domain. In a backdoor attack, a DNN is modified to exhibit expected behaviors under attacker-specified inputs (i.e., triggers). Exploring the backdoor vulnerability of DNN in natural language processing (NLP), recent studies are limited to using specially added words/phrases as the trigger pattern (i.e., word-based triggers), which distorts the semantics of the base sentence, causes perceivable abnormality in linguistic features and can be eliminated by potential defensive techniques. In this paper, we present Linguistic Style-Motivated backdoor attack (LISM), which exploits the implicit linguistic styles as the hidden trigger for backdooring NLP models. Besides the basic requirements on attack success rate and normal model performance, LISM realizes the following advanced design goals compared with previous word-based backdoor: (a) LISM weaponizes text style transfer models to learn to generate sentences with an attacker-specified linguistic style (i.e., trigger style), which largely preserves the malicious semantics of the base sentence and reveals almost no abnormality exploitable by detection algorithms. (b) Each base sentence is dynamically paraphrased to hold the trigger style, which has almost no dependence on common words or phrases and therefore evades existing defenses which exploit the strong correlation between trigger words and misclassification. Extensive evaluation on 5 popular model architectures, 3 real-world security-critical tasks, 3 trigger styles and 3 potential countermeasures strongly validates the effectiveness and the stealthiness of LISM.

Textual Backdoor Attack Via Keyword Positioning

Hidden Backdoors in Human-Centric Language Models

Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger

Rethinking Stealthiness of Backdoor Attack Against NLP Models.

A backdoor attack against LSTM-based text classification systems

Triggerless Backdoor Attack for NLP Tasks with Clean Labels

Efficient Trigger Word Insertion

Backdoor Attacks with Input-unique Triggers in NLP

Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution

BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models

Hidden Trigger Backdoor Attack on NLP Models via Linguistic Style Manipulation

Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks

The triggers that open the NLP model backdoors are hidden in the adversarial samples

Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models

BDDR: An Effective Defense Against Textual Backdoor Attacks

A Practical Trigger-Free Backdoor Attack on Neural Networks

NCL: Textual Backdoor Defense Using Noise-augmented Contrastive Learning

A Black-box NLP Classifier Attacker

Rethink the Evaluation for Attack Strength of Backdoor Attacks in Natural Language Processing

Implementing a Multitarget Backdoor Attack Algorithm Based on Procedural Noise Texture Features

Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models