Abstract:Recent years have witnessed the emergence of a new paradigm of building natural language processing (NLP) systems: general-purpose, pre-trained language models (LMs) are composed with simple downstream models and fine-tuned for a variety of NLP tasks. This paradigm shift significantly simplifies the system development cycles. However, as many LMs are provided by untrusted third parties, their lack of standardization or regulation entails profound security implications, which are largely unexplored. To bridge this gap, this work studies the security threats posed by malicious LMs to NLP systems. Specifically, we present TROJAN-LM, a new class of trojaning attacks in which maliciously crafted LMs trigger host NLP systems to malfunction in a highly predictable manner. By empirically studying three state-of-the-art LMs (BERT, GPT-2, XLNet) in a range of security-critical NLP tasks (toxic comment detection, question answering, text completion) as well as user studies on crowdsourcing platforms, we demonstrate that TROJAN-LM possesses the following properties: (i) flexibility - the adversary is able to flexibly dene logical combinations (e.g., 'and', 'or', 'xor') of arbitrary words as triggers, (ii) efficacy - the host systems misbehave as desired by the adversary with high probability when trigger-embedded inputs are present, (iii) specificity - the trojan LMs function indistinguishably from their benign counterparts on clean inputs, and (iv) fluency - the trigger-embedded inputs appear as fluent natural language and highly relevant to their surrounding contexts. We provide analytical justification for the practicality of TROJAN-LM, and further discuss potential countermeasures and their challenges, which lead to several promising research directions.

Token-modification adversarial attacks for natural language processing: A survey

Token-Modification Adversarial Attacks for Natural Language Processing: A Survey

Adversarial Attack and Defense Technologies in Natural Language Processing: A Survey

Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey

Adversarial Examples Attack and Countermeasure for Speech Recognition System: A Survey.

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Revisiting Character-level Adversarial Attacks for Language Models

Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation

Towards a Robust Deep Neural Network in Texts: A Survey

Adversarial Attack and Defense Strategies of Speaker Recognition Systems: A Survey

Adversarial Attacks on ASR Systems: An Overview

Multi-Granularity Tibetan Textual Adversarial Attack Method Based on Masked Language Model

Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

Towards a Robust Deep Neural Network Against Adversarial Texts: A Survey.

Trojaning Language Models for Fun and Profit

A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents

Pay Attention to the Robustness of Chinese Minority Language Models! Syllable-level Textual Adversarial Attack on Tibetan Script

Expanding Scope: Adapting English Adversarial Attacks to Chinese

Text Adversarial Attacks and Defenses: Issues, Taxonomy, and Perspectives

OpenAttack: An Open-source Textual Adversarial Attack Toolkit

Adversarial Attacks on Large Language Model-Based System and Mitigating Strategies: A Case Study on ChatGPT