ICCAD : U : Optimizing GPU Shared Memory Allocation in Automated Cto-CUDA Compilation

Xinfeng Xie,J. Cong,Yun Liang
2018-01-01
Abstract:To ease the burden of GPGPU programming, several existing frameworks generate efficient CUDA code from C code with user-provided directives. In the automated shared memory allocation, which is the key to CUDA kernel performance, previous works mainly explore different data reuse schemes to minimize the number of global memory transactions. However, our study shows that this intuitive model is not accurate because it omits the impact of shared memory allocation on both the parallelism and locality, which further affects kernel performance. Allocating too large shared memory can improve the data locality by saving the global memory transactions and reducing cache contentions while it limits the thread-level parallelism (TLP). On the contrary, allocating too small shared memory can benefit TLP while hurting data locality. Based on our observations, we develop a performance model to systematically consider the impact of allocating shared memory on the performance. Furthermore, we implement a C-to-CUDA compilation framework that optimizes the shared memory allocation and generates the CUDA code automatically according to the proposed performance model.
What problem does this paper attempt to address?