Abstract:General-purpose graphics processing unit (GPGPU), widely recognized as an exceptional computing platform for de-ploying emerging parallel applications, requires strict adherence to atomicity and memory consistency models for shared variable synchronization. This is crucial to ensure deterministic execution and leverage the performance advantages of the GPGPU single-instruction -multiple-threads architecture. However, the escalating demand for shared variable updates across thread blocks, notably in applications like deep neural networks and graph analysis, significantly exacerbates the serialization overhead of atomic operations due to the von Neumann bottleneck. Additionally, the overhead introduced by memory fences supporting the memory consistency model further complicates this fine-grained synchronization requirement. To address these challenges, this paper proposes Atomic Cache, facilitating an In-Cache computing hardware-software co-design for GPGPUs. At the software level, we propose relaxed memory consistency based on non-ordering commutativity to alleviate the execution of in-cache atomic operations, thereby mitigating the performance overhead of memory fences. At the hardware level, we present the In-Situ Store Atomic Cache Macro, which empowers the Atomic Cache to efficiently execute atomic logic and arithmetic operations within the cache array. This innovation alleviates the von Neumann bottleneck associated with serialized execution of atomic operations. The experimental evaluation results demonstrate that the Atomic Cache can save more than 60% of memory access energy while incurring only 9.42% chip area overhead. Furthermore, it not only delivers an average speedup ratio of 2.59 × and an IPC performance improvement of 1.48× for RISC-V GPGPUs, but also achieves an average speedup ratio of 1.31 × and an IPC performance improvement of 39.92% when compared to state-of-the-art designs employing local atomic buffers.

Optimizing Cache Bypassing and Warp Scheduling for GPUs

Coordinated Static and Dynamic Cache Bypassing for GPUs

An Efficient Compiler Framework for Cache Bypassing on GPUs

Exploring Cache Bypassing and Partitioning for Multi-Tasking on GPUs

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

LWSDP: Locality-Aware Warp Scheduling and Dynamic Data Prefetching Co-design in the Per-SM Private Cache of GPGPUs

Efficient Kernel Management on GPUs.

Advanced hybrid MRAM based novel GPU cache system for graphic processing with high efficiency

GPU Lock-Free Hopscotch Hash Table

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

GPGPU Memory Estimation and Optimization Targeting OpenCL Architecture

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Agglomerative Memory and Thread Scheduling for High-Performance Ray-Tracing on GPUs

Re-Cache: Mitigating Cache Contention by Exploiting Locality Characteristics with Reconfigurable Memory Hierarchy for GPGPUs.

Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory

Generalized Gpu Acceleration For Applications Employing Finite-Volume Methods

An Experimental GPU Global Memory Performance Estimation and Optimization

Efficient GPU Spatial-Temporal Multitasking

TensorCache: Reconstructing Memory Architecture with SRAM-Based In-Cache Computing for Efficient Tensor Computations in GPGPUs

Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations

Performance Evaluation and Optimization of HBM-Enabled GPU for Data-Intensive Applications.