Abstract:Modern graphics processing units (GPUs) are delivering tremendous computing horsepower by running tens of thousands of threads concurrently. The massively parallel execution model has been effective to hide the long latency of off-chip memory accesses in graphics and other general computing applications exhibiting regular memory behaviors. With the fast-growing demand for general purpose computing on GPUs (GPGPU), GPU workloads are becoming highly diversified, and thus requiring a synergistic coordination of both computing and memory resources to unleash the computing power of GPUs. Accordingly, recent graphics processors begin to integrate an on-die level-2 (L2) cache. The huge number of threads on GPUs, however, poses significant challenges to L2 cache design. The experiments on a variety of GPGPU applications reveal that the L2 cache may or may not improve the overall performance depending on the characteristics of applications. In this paper, we propose efficient techniques to improve GPGPU performance by orchestrating both L2 cache and memory in a unified framework. The basic philosophy is to exploit the temporal locality among the massive number of concurrent memory requests and minimize the impact of memory divergence behaviors among simultaneously executed groups of threads. Our major contributions are twofold. First, a priority-based cache management is proposed to maximize the chance of frequently revisited data to be kept in the cache. Second, an effective memory scheduling is introduced to reorder memory requests in the memory controller according to the divergence behavior for reducing average waiting time of warps. Simulation results reveal that our techniques enhance the overall performance by 10% on average for memory intensive benchmarks, whereas the maximum gain can be up to 30%.

PROPERLY GREEDY CACHE PREFETCH INTEGRATED ALGORITHM IN THE PARALLEL FILE SYSTEM

PLC-cache: Endurable SSD Cache for Deduplication-Based Primary Storage

A readahead prefetcher for GPU file system layer

Improving Performance of Parallel I/O Systems Through Selective and Layout-Aware SSD Cache

An Application-Oriented Cache Allocation and Prefetching Method for Long-Running Applications in Distributed Storage Systems

S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems

Coordinated Page Prefetch and Eviction for Memory Oversubscription Management in GPUs

Optimization of software data prefetching in the IA-64 architecture

PCG: Mitigating Conflict-based Cache Side-channel Attacks with Prefetching

Re-Cache: Mitigating Cache Contention by Exploiting Locality Characteristics with Reconfigurable Memory Hierarchy for GPGPUs.

iFetcher: User-Level Prefetching Framework With File-System Event Monitoring for Linux

GI Software with fewer Data Cache Misses

LWSDP: Locality-Aware Warp Scheduling and Dynamic Data Prefetching Co-design in the Per-SM Private Cache of GPGPUs

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

Caiti: I/O transit caching for persistent memory-based block device

Optimizing Parallel I/O Accesses Through Pattern-Directed and Layout-Aware Replication

Lookahead Cache with Instruction Processing Unit for Filling Memory Gap

Gaze into the Pattern: Characterizing Spatial Patterns with Internal Temporal Correlations for Hardware Prefetching

Improving reading performance by file prefetching mechanism in distributed cache systems

PARS: A Pattern-Aware Spatial Data Prefetcher Supporting Multiple Region Sizes

Exploring DRAM Cache Prefetching for Pooled Memory