AEML: An Acceleration Engine for Multi-GPU Load-balancing in Distributed Heterogeneous Environment

Zhuo Tang,Lifan Du,Xuedong Zhang,Li Yang,Kenli Li
DOI: https://doi.org/10.1109/tc.2021.3084407
IF: 3.183
2021-01-01
IEEE Transactions on Computers
Abstract:For the rapid growth computation requirements in big data and artificial intelligence area, CPU-GPU heterogeneous clusters can provide more powerful computing capacity compared to CPU clusters. The high parallel computing capabilities of GPUs greatly accelerate computation-intensive applications. And the number of GPUs on single computing node is scalable, which greatly improves the computing capacity of the cluster under the condition of limited cluster size. However, there is a lack of the effective load-balancing scheduling model in multi-GPU hardware environment. This article proposes AEML, an acceleration engine for multi-GPU load-balancing in distributed heterogeneous environment. AEML can effectively integrate GPUs into distributed processing framework and achieve great load-balance among multiple heterogeneous GPUs. We propose a heterogeneous task execution model based on multiple GPUs and multiple streams (MGMS), which can effectively balance the workload of multiple GPUs. MGMS model utilizes four core techniques: a fine-grained task mapping mechanism, a device resource unified management scheme, a novel resource-aware GPU task scheduling strategy, and a feedback-based streams adjustment scheme. The implementation of AEML system is based on Spark 3.0.0 and NVIDIA CUDA 10.0. We comprehensively evaluate the performance of AEML with multiple typical benchmarks. Experimental results show that AEML can fully exploit the computing power of GPUs and achieve great load-balance among multiple heterogeneous GPUs.
engineering, electrical & electronic,computer science, hardware & architecture
What problem does this paper attempt to address?