Abstract:As cloud-based ML expands, ensuring data security during training and inference is critical. GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions, with CPU TEEs managing data movement and GPU TEEs handling authentication and computation. However, CPU-to-GPU communication overheads significantly hinder performance, as data must be encrypted, authenticated, decrypted, and verified, increasing costs by 12.69 to 33.53 times. This results in GPU TEE inference becoming 54.12% to 903.9% slower and training 10% to 455% slower than non-TEE systems, undermining GPU TEE advantages in latency-sensitive applications. This paper analyzes Nvidia H100 TEE protocols and identifies three key overheads: 1) redundant CPU re-encryption, 2) limited authentication parallelism, and 3) unnecessary operation serialization. We propose Fastrack, optimizing with 1) direct GPU TEE communication, 2) parallelized authentication, and 3) overlapping decryption with PCI-e transmission. These optimizations cut communication costs and reduce inference/training runtime by up to 84.6%, with minimal overhead compared to non-TEE systems.

What problem does this paper attempt to address?

The paper attempts to address the significant communication overhead issue in machine learning (ML) training and inference within GPU-based Trusted Execution Environments (GPU TEE). Specifically: 1. **Data Transfer Overhead**: In existing GPU TEE systems, data transfer between CPU TEE and GPU TEE involves steps such as encryption, generating message authentication codes (MAC), decryption, and verification. These additional computational overheads increase the communication cost from CPU to GPU by 12.69 times to 33.53 times. This slows down GPU TEE inference by 54.12% to 903.9% and training by 10% to 455%. 2. **Redundant Encryption and Authentication**: In traditional secure ML implementations, user input data is already encrypted and authenticated. However, before sending it to GPU TEE, CPU TEE decrypts, re-encrypts, and generates new MAC tags, adding unnecessary computational overhead. 3. **Lack of Parallelism in Authentication**: The current AES-GCM authentication algorithm is sequential, unable to fully utilize the highly parallel computing resources of the GPU, leading to inefficient authentication processes. 4. **Operation Serialization**: In existing implementations, GPU TEE starts decryption and authentication only after receiving all the transmitted data. These operations can be partially parallelized to improve efficiency. To address these issues, the paper proposes Fastrack, which reduces CPU to GPU communication overhead and significantly improves ML inference and training performance through the following optimizations: 1. **Direct Communication Channel**: Allows remote users to establish a direct secure communication channel with GPU TEE, avoiding redundant encryption and authentication steps by CPU TEE. 2. **Increased Authentication Parallelism**: Implements multi-chaining authentication, dividing large data blocks into smaller chunks for parallel authentication, thereby increasing throughput. 3. **Overlapping Decryption and Authentication**: Performs decryption and authentication simultaneously during PCI-e transfer, further reducing communication latency. With these optimizations, Fastrack can significantly reduce CPU to GPU communication costs, reducing end-to-end ML inference and training runtime by up to 84.6%, and in some cases, its performance is comparable to non-TEE GPU systems.

Fastrack: Fast IO for Secure ML using GPU TEEs

Confidential Computing on nVIDIA H100 GPU: A Performance Benchmark Study

Confidential Computing on nVIDIA Hopper GPUs: A Performance Benchmark Study

Safe and Practical GPU Acceleration in TrustZone

Goten: GPU-Outsourcing Trusted Execution of Neural Network Training

Efficient Privacy-Preserving Machine Learning with Lightweight Trusted Hardware

Perun: Secure Multi-Stakeholder Machine Learning Framework with GPU Support

Enabling Rack-scale Confidential Computing using Heterogeneous Trusted Execution Environment

Enabling Privacy-Preserving, Compute- and Data-Intensive Computing using Heterogeneous Trusted Execution Environment

TensorTEE: Unifying Heterogeneous TEE Granularity for Efficient Secure Collaborative Tensor Computing

Honeycomb: Secure and Efficient GPU Executions via Static Validation.

An Efficient Parallel Secure Machine Learning Framework on GPUs

Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators

Empowering Data Centers for Next Generation Trusted Computing

No Privacy Left Outside: on the (In-)Security of TEE-Shielded DNN Partition for On-Device ML

Building a Lightweight Trusted Execution Environment for Arm GPUs

GOAT: GPU Outsourcing of Deep Learning Training With Asynchronous Probabilistic Integrity Verification Inside Trusted Execution Environment

3LegRace: Privacy-Preserving DNN Training over TEEs and GPUs

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware

Privacy preserving layer partitioning for Deep Neural Network models

Confidential Computing on Heterogeneous CPU-GPU Systems: Survey and Future Directions