CPU-assisted GPU thread pool model for dynamic task parallelism

Shuai Zhang,Tao Li,Qiankun Dong,Xuechen Liu,Yulu Yang
DOI: https://doi.org/10.1109/NAS.2015.7255234
2015-01-01
Abstract:With the growing power of GPUs, how to utilize the high computing performance provided by the GPU hardware becomes an urgent yet challenging problem, especially for applications with fine grained parallelism. Task programming is efficient for handling fine grained parallelism but current GPU task parallel solutions using either concurrent kernel execution (CKE) or persistent kernels suffer from a high cost of CPU-GPU interaction. The page-locked host memory supported by new generation GPUs turns CPU-GPU heterogeneous systems into the non-uniform memory access (NUMA) architecture, making it possible to improve CPU-GPU interaction with shared memory programming. In this paper, we propose the CPU-assisted GPU thread pool (CAGTP) model that combines data parallelism and task parallelism at the thread block level to support applications with fine grained parallelism. In the CAGTP model, the Computing Block Level task Scheduling (CBLS) method is designed in which task slots allocated in the page-locked host memory eliminate competition among thread blocks. A separate host scheduler is designed for scheduling tasks to thread blocks and the overhead for scheduling a task (200ns) is much lower than that of similar systems. Experiment results show that the CAGTP model supports fine grained task parallelism with or without dependencies efficiently. It outperforms CKE for batched GEMMs, Cholesky factorization and mixed workloads.
What problem does this paper attempt to address?