SOS! Soft Prompt Attack Against Open-Source Large Language Models

Ziqing Yang,Michael Backes,Yang Zhang,Ahmed Salem
2024-07-03
Abstract:Open-source large language models (LLMs) have become increasingly popular among both the general public and industry, as they can be customized, fine-tuned, and freely used. However, some open-source LLMs require approval before usage, which has led to third parties publishing their own easily accessible versions. Similarly, third parties have been publishing fine-tuned or quantized variants of these LLMs. These versions are particularly appealing to users because of their ease of access and reduced computational resource demands. This trend has increased the risk of training time attacks, compromising the integrity and security of LLMs. In this work, we present a new training time attack, SOS, which is designed to be low in computational demand and does not require clean data or modification of the model weights, thereby maintaining the model's utility intact. The attack addresses security issues in various scenarios, including the backdoor attack, jailbreak attack, and prompt stealing attack. Our experimental findings demonstrate that the proposed attack is effective across all evaluated targets. Furthermore, we present the other side of our SOS technique, namely the copyright token -- a novel technique that enables users to mark their copyrighted content and prevent models from using it.
Cryptography and Security,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is a new type of training - time attack in open - source large language models (LLMs) - SOS (Soft Prompt Attack). This attack method can maliciously manipulate the model's behavior without harming the model's utility, without modifying the model weights or using clean data. Specifically, SOS attacks can cause the following security threats: 1. **Backdoor Attack**: By inserting triggers into the input, the model generates pre - set specific sentences when it receives specific inputs. This may lead to the spread of targeted misinformation. 2. **Jailbreak Attack**: It enables the model to perform tasks that are usually prohibited, such as answering questions related to harmful behaviors. This indicates that current open - source LLMs may not be able to fully follow human ethical standards. 3. **Prompt Stealing Attack**: When the model receives specific triggers, it leaks its system prompt content, thus exposing sensitive information. 4. **Copyright Protection**: Although SOS is mainly an attack method, the author also proposes a technique called "copyright token", which allows users to mark their copyrighted content and prevent the model from using this content. The paper experimentally proves the effectiveness of SOS attacks and shows its wide applicability on different target models and datasets. In addition, the author also explores the potential impact of SOS attacks on the practicality of the model and how to carry out such attacks while maintaining the model's functions. These research results highlight the challenges faced by open - source LLMs in terms of security and also provide new directions for future research.