Fastrack: Fast IO for Secure ML using GPU TEEs

Yongqin Wang,Rachit Rajat,Jonghyun Lee,Tingting Tang,Murali Annavaram
2024-10-20
Abstract:As cloud-based ML expands, ensuring data security during training and inference is critical. GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions, with CPU TEEs managing data movement and GPU TEEs handling authentication and computation. However, CPU-to-GPU communication overheads significantly hinder performance, as data must be encrypted, authenticated, decrypted, and verified, increasing costs by 12.69 to 33.53 times. This results in GPU TEE inference becoming 54.12% to 903.9% slower and training 10% to 455% slower than non-TEE systems, undermining GPU TEE advantages in latency-sensitive applications. This paper analyzes Nvidia H100 TEE protocols and identifies three key overheads: 1) redundant CPU re-encryption, 2) limited authentication parallelism, and 3) unnecessary operation serialization. We propose Fastrack, optimizing with 1) direct GPU TEE communication, 2) parallelized authentication, and 3) overlapping decryption with PCI-e transmission. These optimizations cut communication costs and reduce inference/training runtime by up to 84.6%, with minimal overhead compared to non-TEE systems.
Cryptography and Security,Hardware Architecture
What problem does this paper attempt to address?
The paper attempts to address the significant communication overhead issue in machine learning (ML) training and inference within GPU-based Trusted Execution Environments (GPU TEE). Specifically: 1. **Data Transfer Overhead**: In existing GPU TEE systems, data transfer between CPU TEE and GPU TEE involves steps such as encryption, generating message authentication codes (MAC), decryption, and verification. These additional computational overheads increase the communication cost from CPU to GPU by 12.69 times to 33.53 times. This slows down GPU TEE inference by 54.12% to 903.9% and training by 10% to 455%. 2. **Redundant Encryption and Authentication**: In traditional secure ML implementations, user input data is already encrypted and authenticated. However, before sending it to GPU TEE, CPU TEE decrypts, re-encrypts, and generates new MAC tags, adding unnecessary computational overhead. 3. **Lack of Parallelism in Authentication**: The current AES-GCM authentication algorithm is sequential, unable to fully utilize the highly parallel computing resources of the GPU, leading to inefficient authentication processes. 4. **Operation Serialization**: In existing implementations, GPU TEE starts decryption and authentication only after receiving all the transmitted data. These operations can be partially parallelized to improve efficiency. To address these issues, the paper proposes Fastrack, which reduces CPU to GPU communication overhead and significantly improves ML inference and training performance through the following optimizations: 1. **Direct Communication Channel**: Allows remote users to establish a direct secure communication channel with GPU TEE, avoiding redundant encryption and authentication steps by CPU TEE. 2. **Increased Authentication Parallelism**: Implements multi-chaining authentication, dividing large data blocks into smaller chunks for parallel authentication, thereby increasing throughput. 3. **Overlapping Decryption and Authentication**: Performs decryption and authentication simultaneously during PCI-e transfer, further reducing communication latency. With these optimizations, Fastrack can significantly reduce CPU to GPU communication costs, reducing end-to-end ML inference and training runtime by up to 84.6%, and in some cases, its performance is comparable to non-TEE GPU systems.