Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations

Yicong Zhang,Mingyu Wang,Wangguang Wang,Yangzhan Mai,Haiqiu Huang,Zhiyi Yu
DOI: https://doi.org/10.1109/micro61859.2024.00056
2024-01-01
Abstract:General-purpose graphics processing unit (GPGPU), widely recognized as an exceptional computing platform for de-ploying emerging parallel applications, requires strict adherence to atomicity and memory consistency models for shared variable synchronization. This is crucial to ensure deterministic execution and leverage the performance advantages of the GPGPU single-instruction -multiple-threads architecture. However, the escalating demand for shared variable updates across thread blocks, notably in applications like deep neural networks and graph analysis, significantly exacerbates the serialization overhead of atomic operations due to the von Neumann bottleneck. Additionally, the overhead introduced by memory fences supporting the memory consistency model further complicates this fine-grained synchronization requirement. To address these challenges, this paper proposes Atomic Cache, facilitating an In-Cache computing hardware-software co-design for GPGPUs. At the software level, we propose relaxed memory consistency based on non-ordering commutativity to alleviate the execution of in-cache atomic operations, thereby mitigating the performance overhead of memory fences. At the hardware level, we present the In-Situ Store Atomic Cache Macro, which empowers the Atomic Cache to efficiently execute atomic logic and arithmetic operations within the cache array. This innovation alleviates the von Neumann bottleneck associated with serialized execution of atomic operations. The experimental evaluation results demonstrate that the Atomic Cache can save more than 60% of memory access energy while incurring only 9.42% chip area overhead. Furthermore, it not only delivers an average speedup ratio of 2.59 × and an IPC performance improvement of 1.48× for RISC-V GPGPUs, but also achieves an average speedup ratio of 1.31 × and an IPC performance improvement of 39.92% when compared to state-of-the-art designs employing local atomic buffers.
What problem does this paper attempt to address?