Abstract:The vulnerability of deep neural networks (DNN) to backdoor (trojan) attacks is extensively studied for the image domain. In a backdoor attack, a DNN is modified to exhibit expected behaviors under attacker-specified inputs (i.e., triggers). Exploring the backdoor vulnerability of DNN in natural language processing (NLP), recent studies are limited to using specially added words/phrases as the trigger pattern (i.e., word-based triggers), which distorts the semantics of the base sentence, causes perceivable abnormality in linguistic features and can be eliminated by potential defensive techniques. In this paper, we present Linguistic Style-Motivated backdoor attack (LISM), which exploits the implicit linguistic styles as the hidden trigger for backdooring NLP models. Besides the basic requirements on attack success rate and normal model performance, LISM realizes the following advanced design goals compared with previous word-based backdoor: (a) LISM weaponizes text style transfer models to learn to generate sentences with an attacker-specified linguistic style (i.e., trigger style), which largely preserves the malicious semantics of the base sentence and reveals almost no abnormality exploitable by detection algorithms. (b) Each base sentence is dynamically paraphrased to hold the trigger style, which has almost no dependence on common words or phrases and therefore evades existing defenses which exploit the strong correlation between trigger words and misclassification. Extensive evaluation on 5 popular model architectures, 3 real-world security-critical tasks, 3 trigger styles and 3 potential countermeasures strongly validates the effectiveness and the stealthiness of LISM.

Leverage NLP Models Against Other NLP Models: Two Invisible Feature Space Backdoor Attacks

Rethinking Stealthiness of Backdoor Attack Against NLP Models.

Hidden Backdoors in Human-Centric Language Models

Hidden Trigger Backdoor Attack on NLP Models via Linguistic Style Manipulation

Neural Network Backdoor Attacks Fully Controlled by Composite Natural Utterance Fragments.

Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks

The triggers that open the NLP model backdoors are hidden in the adversarial samples

BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Expose Backdoors on the Way: A Feature-Based Efficient Defense Against Textual Backdoor Attacks

Backdoor Pre-trained Models Can Transfer to All

Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution

Injecting Undetectable Backdoors in Deep Learning and Language Models

An Invisible Backdoor Attack Based On Semantic Feature

Invisible Backdoor Attacks on Deep Neural Networks via Steganography and Regularization

NWS: Natural Textual Backdoor Attacks Via Word Substitution.

Text Laundering: Mitigating Malicious Features Through Knowledge Distillation of Large Foundation Models.

Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger

Training-free Lexical Backdoor Attacks on Language Models

Backdoor Attacks for In-Context Learning with Language Models

Rethink the Evaluation for Attack Strength of Backdoor Attacks in Natural Language Processing