A model-driven approach to warp/thread-block level GPU cache bypassing.

Hongwen Dai,Chao Li,Huiyang Zhou,Saurabh Gupta,Christos Kartsaklis,Mike Mantor
DOI: https://doi.org/10.1145/2897937.2897966
2016-01-01
Abstract:The high amount of memory requests from massive threads may easily cause cache contention and cache-miss-related resource congestion on GPUs. This paper proposes a simple yet effective performance model to estimate the impact of cache contention and resource congestion as a function of the number of warps/thread blocks (TBs) to bypass the cache. Then we design a hardware-based dynamic warp/thread-block level GPU cache bypassing scheme, which achieves 1.68x speedup on average on a set of memory-intensive benchmarks over the baseline. Compared to prior works, our scheme achieves 21.6% performance improvement over SWL-best [29] and 11.9% over CBWT-best [4] on average.
What problem does this paper attempt to address?