SOS! Soft Prompt Attack Against Open-Source Large Language Models

Ziqing Yang,Michael Backes,Yang Zhang,Ahmed Salem

2024-07-03

Abstract:Open-source large language models (LLMs) have become increasingly popular among both the general public and industry, as they can be customized, fine-tuned, and freely used. However, some open-source LLMs require approval before usage, which has led to third parties publishing their own easily accessible versions. Similarly, third parties have been publishing fine-tuned or quantized variants of these LLMs. These versions are particularly appealing to users because of their ease of access and reduced computational resource demands. This trend has increased the risk of training time attacks, compromising the integrity and security of LLMs. In this work, we present a new training time attack, SOS, which is designed to be low in computational demand and does not require clean data or modification of the model weights, thereby maintaining the model's utility intact. The attack addresses security issues in various scenarios, including the backdoor attack, jailbreak attack, and prompt stealing attack. Our experimental findings demonstrate that the proposed attack is effective across all evaluated targets. Furthermore, we present the other side of our SOS technique, namely the copyright token -- a novel technique that enables users to mark their copyrighted content and prevent models from using it.

Cryptography and Security,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is a new type of training - time attack in open - source large language models (LLMs) - SOS (Soft Prompt Attack). This attack method can maliciously manipulate the model's behavior without harming the model's utility, without modifying the model weights or using clean data. Specifically, SOS attacks can cause the following security threats: 1. **Backdoor Attack**: By inserting triggers into the input, the model generates pre - set specific sentences when it receives specific inputs. This may lead to the spread of targeted misinformation. 2. **Jailbreak Attack**: It enables the model to perform tasks that are usually prohibited, such as answering questions related to harmful behaviors. This indicates that current open - source LLMs may not be able to fully follow human ethical standards. 3. **Prompt Stealing Attack**: When the model receives specific triggers, it leaks its system prompt content, thus exposing sensitive information. 4. **Copyright Protection**: Although SOS is mainly an attack method, the author also proposes a technique called "copyright token", which allows users to mark their copyrighted content and prevent the model from using this content. The paper experimentally proves the effectiveness of SOS attacks and shows its wide applicability on different target models and datasets. In addition, the author also explores the potential impact of SOS attacks on the practicality of the model and how to carry out such attacks while maintaining the model's functions. These research results highlight the challenges faced by open - source LLMs in terms of security and also provide new directions for future research.

SOS! Soft Prompt Attack Against Open-Source Large Language Models

SoK: Prompt Hacking of Large Language Models

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

DROJ: A Prompt-Driven Attack against Large Language Models

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Data Stealing Attacks against Large Language Models via Backdooring

Open Sesame! Universal Black Box Jailbreaking of Large Language Models

Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Denial-of-Service Poisoning Attacks against Large Language Models

QROA: A Black-Box Query-Response Optimization Attack on LLMs

Distract Large Language Models for Automatic Jailbreak Attack

On Extracting Specialized Code Abilities from Large Language Models: A Feasibility Study

Training-free Lexical Backdoor Attacks on Language Models

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

Jailbreak Attacks and Defenses Against Large Language Models: A Survey