Abstract:The massive parallelism provided by Graphics Processing Units (GPUs) to accelerate compute-intensive tasks makes it preferable for Real-Time Systems such as autonomous vehicles. Such systems require the execution of heavy Machine Learning (ML) and Computer Vision applications because of the computing power of GPUs. However, such systems need a guarantee of timing predictability. It means the Worst-Case Execution Time (WCET) of the application is estimated tightly and safely to schedule each application before its deadline to avoid catastrophic consequences. As more applications use GPUs, running many applications simultaneously on the same GPU becomes necessary. To provide predictable performance while the application is running in parallel, it must be WCET-aware, which GPUs do not fully support in a multitasking environment. Nvidia recently added a feature called the Multi-Process Service. It allows the different applications to run simultaneously in the same CUDA context by partitioning the compute resources of the GPU. Using this feature, we can measure the interference from co-running GPU applications to estimate WCET. In this paper, we propose a novel technique to estimate the WCET of the GPU kernel using an ML approach. Our approach is based on the application's source, and the model is trained based on the large data set. The approach is flexible and can be applied to different GPU-sharing mechanisms. We allow the victim and enemy kernel of the GPU to execute in parallel to get the maximum interference from the enemy to estimate the WCET of the victim kernel. Enemy kernels are chosen to cause a higher slowdown by acquiring the resources of the victim kernel. We compare our implementation with state-of-the-art approaches to show its effectiveness. Our ML approach reduces the time by 99% in most cases because inferences take only seconds to predict WCET, and the resource consumption required to estimate WCET compared to traditional approaches is minimal because we don't need to execute the application on GPU for hours. Although our approach does not offer safety guarantees because of its empirical nature, we observed that predicted WCETs are always higher than any observed execution times for all benchmarks, and the maximum overestimation factor observed is 11x.

Predicting and Reining in Application-Level Slowdown on Spatial Multitasking GPUs

KSM: Online Application-Level Performance Slowdown Prediction for Spatial Multitasking GPGPU.

Efficient GPU Spatial-Temporal Multitasking

Run-Time Performance Estimation and Fairness-Oriented Scheduling Policy for Concurrent GPGPU Applications

Smart VM Co-Scheduling with the Precise Prediction of Performance Characteristics

Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction

Perph: A Workload Co-location Agent with Online Performance Prediction and Resource Inference

Performance Prediction for Large-Scale Parallel Applications Using Representative Replay

Toward QoS-Awareness and Improved Utilization of Spatial Multitasking GPUs

Process variation-aware workload partitioning algorithms for GPUs supporting spatial-multitasking

A performance prediction scheme for computation-intensive applications on cloud

Interference-Aware Latency Prediction with Kernels for Deep Neural Network

An Accurate Gpu Performance Model For Effective Control Flow Divergence Optimization

CloudProphet: A Machine Learning-Based Performance Prediction for Public Clouds

Dynamic Space-Time Scheduling for GPU Inference

Automated Performance Modeling of HPC Applications Using Machine Learning.

Predicting the Performance-Cost Trade-off of Applications Across Multiple Systems

Providing Predictable Performance Via a Slowdown Estimation Model

Utilizing Machine Learning Techniques for Worst-Case Execution Time Estimation on GPU Architectures

Forecasting GPU Performance for Deep Learning Training and Inference

Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference