Adaptive Cache and Concurrency Allocation on GPGPUs

Zhong Zheng,Zhiying Wang,Mikko Lipasti
DOI: https://doi.org/10.1109/lca.2014.2359882
IF: 2.3
2015-01-01
IEEE Computer Architecture Letters
Abstract:Memory bandwidth is critical to GPGPU performance. Exploiting locality in caches can better utilize memory bandwidth. However, memory requests issued by excessive threads cause cache thrashing and saturate memory bandwidth, degrading performance. In this paper, we propose adaptive cache and concurrency allocation (CCA) to prevent cache thrashing and improve the utilization of bandwidth and computational resources, hence improving performance. According to locality and reuse distance of access patterns in GPGPU program, warps on a stream multiprocessor are dynamically divided into three groups: cached, bypassed, and waiting. The data cache accommodates the footprint of cached warps. Bypassed warps cannot allocate cache lines in the data cache to prevent cache thrashing, but are able to take advantage of available memory bandwidth and computational resource. Waiting warps are de-scheduled. Experimental results show that adaptive CCA can significant improve benchmark performance, with 80 percent harmonic mean IPC improvement over the baseline.
What problem does this paper attempt to address?