Abstract:General-purpose graphics processing unit (GPGPU), widely recognized as an exceptional computing platform for de-ploying emerging parallel applications, requires strict adherence to atomicity and memory consistency models for shared variable synchronization. This is crucial to ensure deterministic execution and leverage the performance advantages of the GPGPU single-instruction -multiple-threads architecture. However, the escalating demand for shared variable updates across thread blocks, notably in applications like deep neural networks and graph analysis, significantly exacerbates the serialization overhead of atomic operations due to the von Neumann bottleneck. Additionally, the overhead introduced by memory fences supporting the memory consistency model further complicates this fine-grained synchronization requirement. To address these challenges, this paper proposes Atomic Cache, facilitating an In-Cache computing hardware-software co-design for GPGPUs. At the software level, we propose relaxed memory consistency based on non-ordering commutativity to alleviate the execution of in-cache atomic operations, thereby mitigating the performance overhead of memory fences. At the hardware level, we present the In-Situ Store Atomic Cache Macro, which empowers the Atomic Cache to efficiently execute atomic logic and arithmetic operations within the cache array. This innovation alleviates the von Neumann bottleneck associated with serialized execution of atomic operations. The experimental evaluation results demonstrate that the Atomic Cache can save more than 60% of memory access energy while incurring only 9.42% chip area overhead. Furthermore, it not only delivers an average speedup ratio of 2.59 × and an IPC performance improvement of 1.48× for RISC-V GPGPUs, but also achieves an average speedup ratio of 1.31 × and an IPC performance improvement of 39.92% when compared to state-of-the-art designs employing local atomic buffers.

Using GPU to accelerate a pin-based multi-level cache simulator

Using GPU to Accelerate Cache Simulation.

GPU Accelerating for Rapid Multi-core Cache Simulation

GCSim: A GPU-Based Trace-Driven Simulator for Multi-level Cache

Cache simulator based on GPU acceleration

Accelerate Cache Simulation with Generic GPU

GPU-based time parallel cache simulator

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

Hardware/Software Co-Simulation for Last Level Cache Exploration

Acceleration for Timing-Aware Gate-Level Logic Simulation with One-Pass GPU Parallelism

An Analytical Approach for Fast and Accurate Design Space Exploration of Instruction Caches

An approach to accessing unified memory address space of heterogeneous kilo-cores system

Locality-protected Cache Allocation Scheme with Low Overhead on GPUs.

Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program

Locality-Driven Dynamic Gpu Cache Bypassing

Accelerating GPGPU Architecture Simulation.

Advanced hybrid MRAM based novel GPU cache system for graphic processing with high efficiency

Adaptive Cache Management for Energy-Efficient GPU Computing.

Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations

Cache-emulated Register File: an Integrated On-Chip Memory Architecture for High Performance GPGPUs

GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation