TensorCache: Reconstructing Memory Architecture with SRAM-Based In-Cache Computing for Efficient Tensor Computations in GPGPUs

Yicong Zhang,Mingyu Wang,Yangzhan Mai,Zhiyi Yu
DOI: https://doi.org/10.1109/tvlsi.2023.3326741
2023-01-01
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Abstract:General purpose graphics processing units (GPGPUs) have emerged as a convincing and pivotal computing platform for deep learning applications. However, the fundamental tensor computations for neural networks on GPGPUs are still restricted by the von Neumann bottleneck. The memory bandwidth and energy consumption of moving a large amount of neural network data between the memory hierarchy and computational units of GPGPUs dominate the overall computational cost. To address these challenges, this article proposes TensorCache to reconstruct memory architecture with static random-access memory (SRAM)-based In-Cache Computing for efficient tensor computations in GPGPUs. It provides an innovative digital SRAM processing-in-memory (PIM) solution by transforming the cache array into large-scale PIM units, effectively mitigating the significant performance and energy consumption losses caused by data movement. To enable efficient hardware-software co-design for TensorCache, a decoupled architecture-based SRAM-PIM macro (SPM) is introduced at the hardware level, supporting in-memory bit-parallel comparison (IMBC) and near-memory radix-4 booth encoder (NRBE) for efficient mixed-precision floating-point (FP) tensor computations. At the software level, a programming model leveraging the GPGPU’s flexible programmability is proposed to bridge the gap between application demands and mismatched hardware/software interfaces. Experimental evaluations demonstrate that TensorCache achieves up to $38.59\times $ speedup and $16.26\times $ throughput enhancement compared to GPU CUDA Cores. Furthermore, it attains an acceleration of up to $1.78\times $ and $3.87\times $ throughput improvement compared to GPU Tensor Cores, while saving power consumption in tensor computations by over 90% with a mere 21% chip area overhead.
What problem does this paper attempt to address?