Abstract:Transformer-based models are widely used in natural language processing tasks, and their application has been further extended to computer vision as well. In their usage, data security has become a crucial concern when deploying deep learning services on cloud platforms. To address these security concerns, Multi-party computation (MPC) is employed to prevent data and model leakage during the inference process. However, Transformer model introduces several challenges for MPC computation, including the time overhead of the Softmax (normalized exponential) function, the accuracy issue caused by the "dynamic range" of approximated division and exponential, and the high memory overhead when processing long sequences. To overcome these challenges, we propose MLformer, an MPC-based inference framework for transformer models based on Crypten Knott et al. (Adv Neural Inf Process Syst 34: 4961–4973, 2021), a secure machine learning framework suggested by Facebook AI Research group, in the semi-honest adversary model. In this framework, we replace the softmax attention with linear attention, which has linear time and memory complexity with input length. The modification eliminates the softmax function entirely, resulting in lower time and memory overhead. To ensure the accuracy of linear attention, we propose the scaled linear attention to address the dynamic range issue caused by the MPC division used and a new approximate division function is proposed to reduce the computational time of the attention block. Furthermore, to improve the efficiency and accuracy of MPC exponential and reciprocal which are commonly used in transformer model, we propose a novel MPC exponential protocol and first integrate the efficient reciprocal protocol Bar-Ilan and Beaver (in Proceedings of the 8th annual ACM symposium on principles of distributed computing, pp. 201–209, 1989) to our framework. Additionally, we optimize the computation of causal linear attention, which is utilized in private inference of auto-regression tasks, using our novel CUDA kernel functions. All the proceeding optimizations contribute to the construction of a more accurate and efficient framework. The experimental results demonstrate that our framework achieves comparable accuracy with reduced inference time and GPU memory overhead compared to the original transformer model. The speedup reaches 78.79% compared to traditional private transformer with input length of 1024 patches.

VMMP: Verifiable Privacy-Preserving Multi-Modal Multi-Task Prediction

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Efficient Computation Sharing for Multi-Task Visual Scene Understanding

Secure and Effective Data Appraisal for Machine Learning

PPTIF: Privacy-Preserving Transformer Inference Framework for Language Translation

A Survey on Private Transformer Inference

MLFormer: a high performance MPC linear inference framework for transformers

ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention

Toward Verifiable and Privacy Preserving Machine Learning Prediction

ViT-MVT: A Unified Vision Transformer Network for Multiple Vision Tasks.

MPC-Pipe: an Efficient Pipeline Scheme for Secure Multi-party Machine Learning Inference

Optimizing Privacy-Preserving Outsourced Convolutional Neural Network Predictions

MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction

Tri-Modal Transformers with Mixture-of-Modality-Experts for Social Media Prediction

SecureGPT: A Framework for Multi-Party Privacy-Preserving Transformer Inference in GPT

Improving Multiple Dense Prediction Performances by Exploiting Inter-Task Synergies for Neuromorphic Vision Sensors

Towards Multi-modal Transformers in Federated Learning

Multimodal Motion Prediction with Stacked Transformers

MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks

FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing