Abstract:General-purpose graphics processing unit (GPGPU), widely recognized as an exceptional computing platform for de-ploying emerging parallel applications, requires strict adherence to atomicity and memory consistency models for shared variable synchronization. This is crucial to ensure deterministic execution and leverage the performance advantages of the GPGPU single-instruction -multiple-threads architecture. However, the escalating demand for shared variable updates across thread blocks, notably in applications like deep neural networks and graph analysis, significantly exacerbates the serialization overhead of atomic operations due to the von Neumann bottleneck. Additionally, the overhead introduced by memory fences supporting the memory consistency model further complicates this fine-grained synchronization requirement. To address these challenges, this paper proposes Atomic Cache, facilitating an In-Cache computing hardware-software co-design for GPGPUs. At the software level, we propose relaxed memory consistency based on non-ordering commutativity to alleviate the execution of in-cache atomic operations, thereby mitigating the performance overhead of memory fences. At the hardware level, we present the In-Situ Store Atomic Cache Macro, which empowers the Atomic Cache to efficiently execute atomic logic and arithmetic operations within the cache array. This innovation alleviates the von Neumann bottleneck associated with serialized execution of atomic operations. The experimental evaluation results demonstrate that the Atomic Cache can save more than 60% of memory access energy while incurring only 9.42% chip area overhead. Furthermore, it not only delivers an average speedup ratio of 2.59 × and an IPC performance improvement of 1.48× for RISC-V GPGPUs, but also achieves an average speedup ratio of 1.31 × and an IPC performance improvement of 39.92% when compared to state-of-the-art designs employing local atomic buffers.

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

Analyzing Memory Access on CPU-GPGPU Shared LLC Architecture

LWSDP: Locality-Aware Warp Scheduling and Dynamic Data Prefetching Co-design in the Per-SM Private Cache of GPGPUs

Re-Cache: Mitigating Cache Contention by Exploiting Locality Characteristics with Reconfigurable Memory Hierarchy for GPGPUs.

Equidistant Memory Access Coalescing on GPGPU

Coordinated Static and Dynamic Cache Bypassing for GPUs

An Experimental GPU Global Memory Performance Estimation and Optimization

GPGPU Memory Estimation and Optimization Targeting OpenCL Architecture

Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations

Optimizing Cache Bypassing and Warp Scheduling for GPUs

A perceptual and predictive batch-processing memory scheduling strategy for a CPU-GPU heterogeneous system

Two Methods for Combining Original Memory Access Coalescing and Equivalent Memory Access Coalescing on GPGPU.

Thread Batching for High-performance Energy-efficient GPU Memory Design

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

A CPU-GPGPU Scheduler Based on Data Transmission Bandwidth of Workload

Improving Multi-Application Concurrency Support Within the GPU Memory System

Buffer on Last Level Cache for CPU and GPGPU Data Sharing

Techniques for Shared Resource Management in Systems with Throughput Processors

A Framework for Memory Oversubscription Management in Graphics Processing Units

Exploring Time-Predictable and High-Performance Last-Level Caches for Hard Real-Time Integrated CPU-GPU Processors.