Abstract:Trusted execution environments (TEEs) for machine learning accelerators are indispensable in secure and efficient ML inference. Optimizing workloads through state-space exploration for the accelerator architectures improves performance and energy consumption. However, such explorations are expensive and slow due to the large search space. Current research has to use fast analytical models that forego critical hardware details and cross-layer opportunities unique to the hardware security primitives. While cycle-accurate models can theoretically reach better designs, their high runtime cost restricts them to a smaller state space. We present Obsidian, an optimization framework for finding the optimal mapping from ML kernels to a secure ML accelerator. Obsidian addresses the above challenge by exploring the state space using analytical and cycle-accurate models cooperatively. The two main exploration components include: (1) A secure accelerator analytical model, that includes the effect of secure hardware while traversing the large mapping state space and produce the best m model mappings; (2) A compiler profiling step on a cycle-accurate model, that captures runtime bottlenecks to further improve execution runtime, energy and resource utilization and find the optimal model mapping. We compare our results to a baseline secure accelerator, comprising of the state-of-the-art security schemes obtained from guardnn [ 33 ] and sesame [11]. The analytical model reduces the inference latency by 20.5% for a cloud and 8.4% for an edge deployment with an energy improvement of 24% and 19% respectively. The cycle-accurate model, further reduces the latency by 9.1% for a cloud and 12.2% for an edge with an energy improvement of 13.8% and 13.1%.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in secure machine - learning (ML) accelerators, how to optimize workload mapping through state - space exploration to improve performance and energy efficiency. Specifically, the paper proposes an optimization framework named Obsidian, which aims to find the best mapping from ML kernels to secure ML accelerators, thereby reducing inference latency and improving resource utilization. ### Problem Background 1. **Importance of Trusted Execution Environments (TEEs)**: - TEEs are crucial for the security and efficiency of machine - learning accelerators. - Optimizing the workload of accelerator architectures through state - space exploration can improve performance and energy consumption. 2. **Limitations of Existing Methods**: - Current research uses fast - analysis models, ignoring key hardware details and cross - layer opportunities. - Although cycle - accurate models can theoretically achieve better designs, their high runtime costs limit the scope of state - space exploration. ### Obsidian's Solution Obsidian explores the state - space by using analysis models and cycle - accurate models cooperatively, specifically including: 1. **Secure Accelerator Analysis Model**: - It includes the impact of secure hardware and generates the best model mapping when traversing the large mapping state - space. 2. **Compiler Profile Step**: - Conduct profile analysis on the cycle - accurate model, capture runtime bottlenecks, further improve execution time, energy, and resource utilization, and find the optimal model mapping. ### Main Contributions - **Analysis Phase**: Efficiently search the large state - space through the analysis model. - **Profile Phase**: Solve runtime bottlenecks through compiler profiles. - **Co - exploration**: Combine the advantages of the analysis model and compiler profiles to find the optimal mapping. ### Experimental Results Compared with the baseline secure accelerator, Obsidian performs excellently in the following aspects: - The analysis model reduces 20.5% of the inference latency and 8.4% of the edge - deployment latency in the cloud environment, while reducing energy consumption by 24% and 19% respectively. - The cycle - accurate model further reduces 9.1% of the latency in the cloud environment and 12.2% of the latency in edge - deployment, while reducing energy consumption by 13.8% and 13.1% respectively. ### Summary Through cooperative state - space exploration, Obsidian effectively solves the performance and energy - efficiency optimization problems in secure ML accelerators, providing significant improvements for practical applications.

Obsidian: Cooperative State-Space Exploration for Performant Inference on Secure ML Accelerators

SESAME: Software defined Enclaves to Secure Inference Accelerators with Multi-tenant Execution

An Open-Source ML-Based Full-Stack Optimization Framework for Machine Learning Accelerators

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware

OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning Accelerators

TEESlice: Slicing DNN Models for Secure and Efficient Deployment

Customizing Trusted AI Accelerators for Efficient Privacy-Preserving Machine Learning

PALM: A Efficient Performance Simulator for Tiled Accelerators with Large-scale Model Training

Apollo: Transferable Architecture Exploration

Data-Oblivious ML Accelerators using Hardware Security Extensions

Bandwidth Utilization Side-Channel on ML Inference Accelerators

Empowering Data Centers for Next Generation Trusted Computing

Mitigating Edge Machine Learning Inference Bottlenecks: An Empirical Study on Accelerating Google Edge Models

Defense against ML-based Power Side-channel Attacks on DNN Accelerators with Adversarial Attacks

Synergy: Towards On-Body AI via Tiny AI Accelerator Collaboration on Wearables

ACAI: Protecting Accelerator Execution with Arm Confidential Computing Architecture

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

ShadowNet: A Secure and Efficient On-device Model Inference System for Convolutional Neural Networks

Multi-Component Optimization and Efficient Deployment of Neural-Networks on Resource-Constrained IoT Hardware

Privacy preserving layer partitioning for Deep Neural Network models