Abstract:Transformer-based models are widely used in natural language processing tasks, and their application has been further extended to computer vision as well. In their usage, data security has become a crucial concern when deploying deep learning services on cloud platforms. To address these security concerns, Multi-party computation (MPC) is employed to prevent data and model leakage during the inference process. However, Transformer model introduces several challenges for MPC computation, including the time overhead of the Softmax (normalized exponential) function, the accuracy issue caused by the "dynamic range" of approximated division and exponential, and the high memory overhead when processing long sequences. To overcome these challenges, we propose MLformer, an MPC-based inference framework for transformer models based on Crypten Knott et al. (Adv Neural Inf Process Syst 34: 4961–4973, 2021), a secure machine learning framework suggested by Facebook AI Research group, in the semi-honest adversary model. In this framework, we replace the softmax attention with linear attention, which has linear time and memory complexity with input length. The modification eliminates the softmax function entirely, resulting in lower time and memory overhead. To ensure the accuracy of linear attention, we propose the scaled linear attention to address the dynamic range issue caused by the MPC division used and a new approximate division function is proposed to reduce the computational time of the attention block. Furthermore, to improve the efficiency and accuracy of MPC exponential and reciprocal which are commonly used in transformer model, we propose a novel MPC exponential protocol and first integrate the efficient reciprocal protocol Bar-Ilan and Beaver (in Proceedings of the 8th annual ACM symposium on principles of distributed computing, pp. 201–209, 1989) to our framework. Additionally, we optimize the computation of causal linear attention, which is utilized in private inference of auto-regression tasks, using our novel CUDA kernel functions. All the proceeding optimizations contribute to the construction of a more accurate and efficient framework. The experimental results demonstrate that our framework achieves comparable accuracy with reduced inference time and GPU memory overhead compared to the original transformer model. The speedup reaches 78.79% compared to traditional private transformer with input length of 1024 patches.

Primer: Fast Private Transformer Inference on Encrypted Data

LLMs Can Understand Encrypted Prompt: Towards Privacy-Computing Friendly Transformers

CipherFormer: Efficient Transformer Private Inference with Low Round Complexity

PPTIF: Privacy-Preserving Transformer Inference Framework for Language Translation

A Survey on Private Transformer Inference

$\textit{Comet:}$ A $\underline{Com}$munication-$\underline{e}$fficient and Performant Approxima$\underline{t}$ion for Private Transformer Inference

I can't see it but I can Fine-tune it: On Encrypted Fine-tuning of Transformers using Fully Homomorphic Encryption

East: Efficient and Accurate Secure Transformer Framework for Inference

Primer: Searching for Efficient Transformers for Language Modeling

Secure Transformer Inference Protocol

A Fast Post-Training Pruning Framework for Transformers

SecureGPT: A Framework for Multi-Party Privacy-Preserving Transformer Inference in GPT

MLFormer: a high performance MPC linear inference framework for transformers

SecPE: Secure Prompt Ensembling for Private and Robust Large Language Models

TextFusion: Privacy-Preserving Pre-trained Model Inference Via Token Fusion.

The Inhibitor: ReLU and Addition-Based Attention for Efficient Transformers

PUMA: Secure Inference of LLaMA-7B in Five Minutes

MERGE: Fast Private Text Generation

Privacy-Preserving Vision Transformer on Permutation-Encrypted Images

Nimbus: Secure and Efficient Two-Party Inference for Transformers

Faster CryptoNets: Leveraging Sparsity for Real-World Encrypted Inference