The Unicorn Runtime: Efficient Distributed Shared Memory Programming for Hybrid CPU-GPU Clusters
Tarun Beri,Sorav Bansal,Subodh Kumar
DOI: https://doi.org/10.1109/tpds.2016.2616314
IF: 5.3
2017-05-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Programming hybrid CPU-GPU clusters is hard. This paper addresses this difficulty and presents the design and runtime implementation of Unicorn-a parallel programming model for hybrid CPU-GPU clusters. In particular, this paper proves that efficient distributed shared memory style programing is possible and its simplicity can be retained across CPUs and GPUs in a cluster, minus the frustration of dealing with race conditions. Further, this can be done with a unified abstraction, avoiding much of the complication of dealing with hybrid architectures. This is achieved with the help of transactional semantics (on shared global address spaces), deferred bulk data synchronization, workload pipelining and various communication and computation scheduling optimizations. We describe the said abstraction, our computation and communication scheduling system and report its performance on a few benchmarks like Matrix Multiplication, LU Decomposition and 2D FFT. We find that parallelization of coarse-grained applications like matrix multiplication or 2D FFT using our system requires only about 30 lines of C code to set up the runtime. The rest of the application code is regular single CPU/GPU implementation. This indicates the ease of extending parallel code to a distributed environment. The execution is efficient as well. When multiplying two square matrices of size 65, 536 χ 65,536, Unicornachieves a peak performance of 7.88 TFlop/s when run over a cluster of 14 nodes with each node equipped with two Tesla M2070 GPUs and two 6-core Intel Xeon 2.67 GHz CPUs, connected over a 32Gbps Infiniband network. In this paper, we also demonstrate that the Unicorn programming model can be efficiently used to implement high level abstractions like MapReduce. We use such an extension to implement PageRank and report its performance. For a sample web of 500 million web pages, our implementation completes a page rank iteration in about 18 seconds (on average) on a 14-node cluster.
computer science, theory & methods,engineering, electrical & electronic