Abstract:Graphics Processing Units (GPUs) computing has become ubiquitous for embedded system, evidenced by its wide adoption for various general purpose applications. As more and more applications are accelerated by GPUs, multi-tasking scenario starts to emerge. Multi-tasking allows multiple applications to simultaneously execute on the same GPU and share the resource. This brings new challenges due to the contention among the different applications for the shared resources such as caches. However, the caches on GPUs are difficult to use. If used inappropriately, it may hurt the performance instead of improving it. In this paper, we propose to use cache partitioning together with cache bypassing as the shared cache management mechanism for multi-tasking on GPUs. The combined approach aims to reduce the interference among the tasks and preserve the locality for each task. However, the interplay among the cache partitioning and bypassing brings greater challenges. On one hand, the partitioned cache space to each task affects its cache bypassing decision. On the other hand, cache bypassing affects the cache capacity required for each task. To address this, we propose a two-step approach. First, we use cache partitioning to assign dedicated cache space to each task to reduce the interference among the tasks. During this process, we compare cache partitioning with coarse-grained cache bypassing. Then, we use fine-grained cache bypassing to selectively bypass certain data requests and threads for each task. We explore different cache partitioning and bypassing designs and demonstrate the potential benefits of this approach. Experiments using a wide range of applications demonstrate that our technique improves the overall system throughput by 52% on average compared to the default multi-tasking solution on GPUs.

Improving CPU and GPU Performance Through Sample-Based Dynamic LLC Bypassing

Coordinated Static and Dynamic Cache Bypassing for GPUs

Locality-Driven Dynamic Gpu Cache Bypassing

DD-L1D: Improving the Decoupled L1D Efficiency for GPU Architecture

Improve Llc Bypassing Performance By Memory Controller Improvements In Heterogeneous Multicore System

Selectively GPU Cache Bypassing for Un-Coalesced Loads.

Improving the Performance of Heterogeneous Multi-Core Processors by Modifying the Cache Coherence Protocol

WAP: the Warp Feature Aware Prefetching Method for LLC on CPU-GPU Heterogeneous Architecture

Analyzing Memory Access on CPU-GPGPU Shared LLC Architecture

Performance Optimization by Dynamically Altering Cache Replacement Algorithm in CPU-GPU Heterogeneous Multi-Core Architecture.

Exploring Cache Bypassing and Partitioning for Multi-Tasking on GPUs

Buffer on Last Level Cache for CPU and GPGPU Data Sharing

Efficient Data Transfer in a Heterogeneous Multicore-Based CE Systems Using Cache Performance Optimization

Performance evaluation and optimization of cache on fused CPU-GPU architecture

An Efficient Compiler Framework for Cache Bypassing on GPUs

Miss-aware LLC Buffer Management Strategy Based on Heterogeneous Multi-Core

A model-driven approach to warp/thread-block level GPU cache bypassing.

Exploring Time-Predictable and High-Performance Last-Level Caches for Hard Real-Time Integrated CPU-GPU Processors.

Set variation-aware shared LLC management for CPU-GPU heterogeneous architecture

Adaptive Cache Management for Energy-Efficient GPU Computing.

Research on Cache Partitioning and Adaptive Replacement Policy for CPU-GPU Heterogeneous Processors