Aligning LLMs to Be Robust Against Prompt Injection

Sizhe Chen,Arman Zharmagambetov,Saeed Mahloujifar,Kamalika Chaudhuri,Chuan Guo
2024-10-08
Abstract:Large language models (LLMs) are becoming increasingly prevalent in modern software systems, interfacing between the user and the internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be carefully crafted and injected into external data sources to override the user's intended instruction and instead execute a malicious instruction. Prompt injection attacks constitute a major threat to LLM security, making the design and implementation of practical countermeasures of paramount importance. To this end, we show that alignment can be a powerful tool to make LLMs more robust against prompt injection. Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks and constructing pairs of desirable and undesirable responses. Then, we apply existing alignment techniques to fine-tune the LLM to be robust against these simulated attacks. Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility. Moreover, SecAlign's protection generalizes to strong attacks unseen in training. Specifically, the success rate of state-of-the-art GCG-based prompt injections drops from 56% to 2% in Mistral-7B after our alignment process. Our code is released at <a class="link-external link-https" href="https://github.com/facebookresearch/SecAlign" rel="external noopener nofollow">this https URL</a>
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the security threats faced by large - language models (LLMs) in modern software systems, especially **prompt injection attacks**. Specifically: 1. **Problem background**: - Large - language models (LLMs) have become an important part of modern software systems. They help users complete various complex tasks through natural - language processing and text - generation capabilities. - In these tasks, LLMs often need to use external data sources (such as user documents, web search results, API call results, etc.), which provides attackers with opportunities to manipulate LLMs. 2. **The threat of prompt injection attacks**: - Attackers can carefully design and inject malicious prompts (adversarial prompts) into external data sources, overriding the user's original instructions and making the LLM execute malicious instructions. - This attack method poses a significant threat to the security of the LLM and is considered one of the primary security risks in LLM applications (according to OWASP's assessment). 3. **Limitations of existing defense methods**: - Existing defense methods (such as StruQ) can resist prompt injection attacks to a certain extent, but they perform poorly in the face of unseen strong attacks, especially those optimized attacks (such as GCG attacks). - These methods lack the generalization ability against unknown attacks, resulting in being easily bypassed by attackers in actual deployment. 4. **The solution proposed in the paper**: - The paper proposes a new defense method - **SecAlign**. By transforming the prompt - injection - defense problem into a preference - optimization problem, it uses alignment training to improve the robustness of the LLM against prompt - injection attacks. - SecAlign first constructs an alignment data set containing desired and undesired outputs, and then uses existing alignment techniques to fine - tune the LLM so that it can resist simulated prompt - injection attacks. 5. **Experimental results**: - Experiments show that SecAlign significantly improves the LLM's defense ability against prompt - injection attacks. In particular, under the strongest GCG attack, the attack success rate is reduced from 56% to 2%, and it has almost no impact on the practicality of the model. In summary, the main goal of this paper is to make the LLM more effectively resist prompt - injection attacks and thus improve its security in practical applications by introducing a new method based on alignment training.