Embedding-based classifiers can detect prompt injection attacks
Md. Ahsan Ayub,Subhabrata Majumdar
2024-10-30
Abstract:Large Language Models (LLMs) are seeing significant adoption in every type of organization due to their exceptional generative capabilities. However, LLMs are found to be vulnerable to various adversarial attacks, particularly prompt injection attacks, which trick them into producing harmful or inappropriate content. Adversaries execute such attacks by crafting malicious prompts to deceive the LLMs. In this paper, we propose a novel approach based on embedding-based Machine Learning (ML) classifiers to protect LLM-based applications against this severe threat. We leverage three commonly used embedding models to generate embeddings of malicious and benign prompts and utilize ML classifiers to predict whether an input prompt is malicious. Out of several traditional ML methods, we achieve the best performance with classifiers built using Random Forest and XGBoost. Our classifiers outperform state-of-the-art prompt injection classifiers available in open-source implementations, which use encoder-only neural networks.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that large language models (LLMs) are vulnerable to the threat of prompt injection attacks. Specifically, such attacks induce LLMs to generate harmful or inappropriate content by constructing malicious prompts. To address this issue, the author proposes a new method based on an embedded machine - learning classifier to protect LLM - based applications from this serious threat.
### Detailed Explanation
1. **Background and Problem Description**:
- Large language models (LLMs) are widely used because of their excellent generation capabilities, but they are also vulnerable to various adversarial attacks, especially prompt injection attacks.
- Prompt injection attacks deceive LLMs through carefully designed malicious prompts, causing them to produce harmful content.
- These attacks are not limited to directly inputting malicious prompts, but also include modifying benign user - provided prompts through man - in - the - middle attacks, or injecting malicious prompts from external sources (such as websites or files).
2. **Research Objectives**:
- **RQ1**: Are there differences between malicious prompts and benign prompts in the embedding space?
- **RQ2**: Can malicious prompts be effectively identified to prevent prompt injection attacks?
3. **Solutions**:
- The author uses three commonly used embedding models (OpenAI, GTE, MiniLM) to generate the embedding representations of malicious and benign prompts.
- Using these embeddings as input datasets, a variety of supervised machine - learning classifiers (such as Random Forest, XGBoost, etc.) are constructed to detect prompt injection attacks.
- The effectiveness of the proposed method is verified by comparing it with the existing state - of - the - art deep - learning - based prompt injection classifiers.
4. **Main Contributions**:
- The distribution differences between benign and malicious embeddings generated using three embedding models are studied.
- A series of embedded - based supervised machine - learning classifiers are constructed and perform well on multiple evaluation metrics.
- The experimental results show that the classifiers based on Random Forest and OpenAI embeddings outperform the existing open - source implementations in terms of AUC, precision, and recall.
### Formula Representation
- **Precision (Precision Rate)**:
\[
\text{Precision}=\frac{\text{True Positive}}{\text{True Positive}+\text{False Positive}}
\]
- **Recall (Recall Rate)**:
\[
\text{Recall}=\frac{\text{True Positive}}{\text{True Positive}+\text{False Negative}}
\]
- **F1 Score**:
\[
\text{F1}=\frac{2\times\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}
\]
Through these formulas, the performance evaluation metrics of the classifier can be more clearly understood.