Abstract:We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can thrash the DRAM cache, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multi-dimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probes and increase effective DRAM BW with minimal cost, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache's Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, HMS improves performance by up to 12.5x (2.9x overall) and reduces energy by up to 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93% and 57-75%, respectively.

A model-driven approach to warp/thread-block level GPU cache bypassing.

Coordinated Static and Dynamic Cache Bypassing for GPUs

Locality-Driven Dynamic Gpu Cache Bypassing

Adaptive Cache Management for Energy-Efficient GPU Computing.

RACB: Resource Aware Cache Bypass on GPUs

Adaptive Cache and Concurrency Allocation on GPGPUs

Selectively GPU Cache Bypassing for Un-Coalesced Loads.

Optimizing Cache Bypassing and Warp Scheduling for GPUs

Exploring Cache Bypassing and Partitioning for Multi-Tasking on GPUs

Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU

Improving CPU and GPU Performance Through Sample-Based Dynamic LLC Bypassing

CWLP: Coordinated Warp Scheduling and Locality-Protected Cache Allocation on GPUs.

Adaptive Cache Bypass and Insertion for Many-core Accelerators

Statistical Cache Bypassing for Non-Volatile Memory

Eliminating Intra-Warp Conflict Misses in GPU.

An Efficient Compiler Framework for Cache Bypassing on GPUs

DD-L1D: Improving the Decoupled L1D Efficiency for GPU Architecture

Online Cache Modeling for Commodity Multicore Processors

Using GPU to Accelerate Cache Simulation.

Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing Units

Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory