Abstract:General-purpose graphics processing unit (GPGPU), widely recognized as an exceptional computing platform for de-ploying emerging parallel applications, requires strict adherence to atomicity and memory consistency models for shared variable synchronization. This is crucial to ensure deterministic execution and leverage the performance advantages of the GPGPU single-instruction -multiple-threads architecture. However, the escalating demand for shared variable updates across thread blocks, notably in applications like deep neural networks and graph analysis, significantly exacerbates the serialization overhead of atomic operations due to the von Neumann bottleneck. Additionally, the overhead introduced by memory fences supporting the memory consistency model further complicates this fine-grained synchronization requirement. To address these challenges, this paper proposes Atomic Cache, facilitating an In-Cache computing hardware-software co-design for GPGPUs. At the software level, we propose relaxed memory consistency based on non-ordering commutativity to alleviate the execution of in-cache atomic operations, thereby mitigating the performance overhead of memory fences. At the hardware level, we present the In-Situ Store Atomic Cache Macro, which empowers the Atomic Cache to efficiently execute atomic logic and arithmetic operations within the cache array. This innovation alleviates the von Neumann bottleneck associated with serialized execution of atomic operations. The experimental evaluation results demonstrate that the Atomic Cache can save more than 60% of memory access energy while incurring only 9.42% chip area overhead. Furthermore, it not only delivers an average speedup ratio of 2.59 × and an IPC performance improvement of 1.48× for RISC-V GPGPUs, but also achieves an average speedup ratio of 1.31 × and an IPC performance improvement of 39.92% when compared to state-of-the-art designs employing local atomic buffers.

An Efficient Compiler Framework for Cache Bypassing on GPUs

Coordinated Static and Dynamic Cache Bypassing for GPUs

Optimizing Cache Bypassing and Warp Scheduling for GPUs

Exploring Cache Bypassing and Partitioning for Multi-Tasking on GPUs

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

A Compiler-assisted Locality Aware CTA Mapping Scheme

Re-Cache: Mitigating Cache Contention by Exploiting Locality Characteristics with Reconfigurable Memory Hierarchy for GPGPUs.

Efficient Kernel Management on GPUs.

Statistical Cache Bypassing for Non-Volatile Memory

Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

HBPB, Applying Reuse Distance to Improve Cache Efficiency Proactively

A Hybrid Approach to Cache Management in Heterogeneous CPU-FPGA Platforms.

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

GPU First -- Execution of Legacy CPU Codes on GPUs

Fleche: an efficient GPU embedding cache for personalized recommendations

ICCAD : U : Optimizing GPU Shared Memory Allocation in Automated Cto-CUDA Compilation

Explicit caching HYB: a new high-performance SpMV framework on GPGPU

Conflict-aware compiler for hierarchical register file on GPUs

TensorCache: Reconstructing Memory Architecture with SRAM-Based In-Cache Computing for Efficient Tensor Computations in GPGPUs