Abstract:Large Language Models (LLMs) excel in natural language understanding by capturing hidden semantics in vector space. This process enriches the value of text embeddings for various downstream tasks, thereby fostering the Embedding-as-a-Service (EaaS) business model. However, the risk of privacy leakage due to direct text transmission to servers remains a critical concern. To address this, we introduce Split-N-Denoise (SnD), an private inference framework that splits the model to execute the token embedding layer on the client side at minimal computational cost. This allows the client to introduce noise prior to transmitting the embeddings to the server, and subsequently receive and denoise the perturbed output embeddings for downstream tasks. Our approach is designed for the inference stage of LLMs and requires no modifications to the model parameters. Extensive experiments demonstrate SnD's effectiveness in optimizing the privacy-utility tradeoff across various LLM architectures and diverse downstream tasks. The results reveal an improvement in performance under the same privacy budget compared to the baselines by over 10\% on average, offering clients a privacy-preserving solution for local privacy protection.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to protect users' privacy when using large language models (LLMs) for reasoning. Specifically, when users obtain text embeddings generated by LLMs through network services (i.e., "Embedding - as - a - Service" EaaS), there is a risk of privacy leakage due to the direct transmission of text to the server. To solve this problem, the authors propose the Split - N - Denoise (SnD) framework, which is a private reasoning framework. It protects users' privacy by performing the token - embedding layer on the client side and introducing noise before transmitting the embeddings. In addition, SnD also includes a denoising module that allows the client to receive and denoise the perturbed output embeddings returned from the server for downstream tasks. This method does not require any modification of the model parameters and optimizes the trade - off between privacy and utility in various LLM architectures and different downstream tasks.
### Main Contributions
- **Proposing the SnD Framework**: Combining split - reasoning and denoising techniques to protect users' privacy under the constraint of local differential privacy (LDP). Empirical studies show that this method improves the average performance by more than 10% compared to existing differential - privacy - based methods under the same privacy budget, and can maintain utility even in extremely low - privacy - budget settings (\(\eta \leq 0.01\)).
- **Designing an Innovative Denoising Method**: Deploying a denoising model on the client side. This model is pre - trained on the server side using public datasets and synthetic noise, and then deployed on the client side to enhance the embeddings by using the specific noise level provided by the user and the original intermediate results (IRs).
### Method Overview
1. **Local Encoder Module**: The user obtains the token - embeddings of the input locally.
2. **Privatization Module**: The user privatizes the token - representations before transmitting them to the server to meet the LDP requirements.
3. **Cloud Encoder Module**: The server transforms the privatized token - representations and returns the embeddings to the user.
4. **Denoising Module**: The user uses its original input and specific noise level to locally denoise the received embeddings to optimize the balance between privacy and utility.
### Noise Mechanism
The authors use \(d\chi\)-privacy to privatize the token - representation layer of the client. Given an input sequence \(x = [x_1,\ldots,x_n]\), the token - representation layer converts it into a vector sequence \(X = [x_1,\ldots,x_n] \in \mathbb{R}^{n\times d}\). Assuming that the L2 - norm is used as the distance metric, applying the \(d\chi\)-privacy mechanism with the parameter \(\eta\), the implementation for a given word embedding \(x_t \in \mathbb{R}^d\) is by adding Laplace noise \(z \sim c\exp(-\eta \|z\|)\), where \(c\) is a real - valued constant. To improve the performance of the denoising model, the client clips the L2 - norm of the privatized representation so that it does not exceed \(C_{x_t}\):
\[M'(x_t) = M(x_t)\cdot\min\left(1,\frac{C_{x_t}}{\|M(x_t)\|}\right)\]
where \(C_{x_t}=\max_{x_t \in X_t}\|x_t\|\).
### Denoising Model
The limitation of server - side denoising lies in the lack of knowledge about the noise level, which limits its denoising ability. Therefore, the authors propose a client - side denoising framework, where the user uses its specific noise and original input to perform error correction on the perturbed embeddings. Given the privatized token - representations \(\tilde{X} = [\tilde{x}_1,\ldots,\tilde{x}_n]\) and the noise matrix \(Z = [z_1,\ldots,z_n] \in \mathbb{R}^{n\times d}\), the denoising model is parameterized by an L - layer Transformer decoder:
\[e_d = D(e_n,\tilde{X},Z)\]
The input of the denoising model is a concatenation of vectors:
\[H_0 = [e_n;\tilde{x}_1,\ldots,\tilde{