Abstract:The massive parallelism provided by Graphics Processing Units (GPUs) to accelerate compute-intensive tasks makes it preferable for Real-Time Systems such as autonomous vehicles. Such systems require the execution of heavy Machine Learning (ML) and Computer Vision applications because of the computing power of GPUs. However, such systems need a guarantee of timing predictability. It means the Worst-Case Execution Time (WCET) of the application is estimated tightly and safely to schedule each application before its deadline to avoid catastrophic consequences. As more applications use GPUs, running many applications simultaneously on the same GPU becomes necessary. To provide predictable performance while the application is running in parallel, it must be WCET-aware, which GPUs do not fully support in a multitasking environment. Nvidia recently added a feature called the Multi-Process Service. It allows the different applications to run simultaneously in the same CUDA context by partitioning the compute resources of the GPU. Using this feature, we can measure the interference from co-running GPU applications to estimate WCET. In this paper, we propose a novel technique to estimate the WCET of the GPU kernel using an ML approach. Our approach is based on the application's source, and the model is trained based on the large data set. The approach is flexible and can be applied to different GPU-sharing mechanisms. We allow the victim and enemy kernel of the GPU to execute in parallel to get the maximum interference from the enemy to estimate the WCET of the victim kernel. Enemy kernels are chosen to cause a higher slowdown by acquiring the resources of the victim kernel. We compare our implementation with state-of-the-art approaches to show its effectiveness. Our ML approach reduces the time by 99% in most cases because inferences take only seconds to predict WCET, and the resource consumption required to estimate WCET compared to traditional approaches is minimal because we don't need to execute the application on GPU for hours. Although our approach does not offer safety guarantees because of its empirical nature, we observed that predicted WCETs are always higher than any observed execution times for all benchmarks, and the maximum overestimation factor observed is 11x.

Engineering Worst-Case Inputs for Pairwise Merge Sort on GPUs

Efficient parallel merge sort for fixed and variable length keys

GPU accelerate parallel Odd-Even merge sort: An OpenCL method

An Efficient Multiway Mergesort for GPU Architectures

Data Level Parallelism Implementation of Odd-even Merge Sort

On Maximizing the Throughput of Multiprocessor Tasks.

Bank Conflict Free Comparison-based Sorting On GPUs

Parallel Shellsort Algorithm for Many-Core GPUs with CUDA

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

Parallel Merge Sort with Load Balancing

Galloping in fast-growth natural merge sorts

Optimized Merge Sort on Modern Commodity Multi-core CPUs

Parallel Sorting by Approximate Splitting for Multi-core Processors

A Real-Time Spike Sorting Method Based on the Embedded GPU

A Hybrid Vectorized Merge Sort on ARM NEON

A study of integer sorting on multicores

Parallel Multi-Deque Partition Dual-Deque Merge sorting algorithm using OpenMP

Count Sort for GPU Computing

Utilizing Machine Learning Techniques for Worst-Case Execution Time Estimation on GPU Architectures

Parallelization of Modified Merge Sort Algorithm

Improving the Scalability of GPU Synchronization Primitives