iMLBench: A Machine Learning Benchmark Suite for CPU-GPU Integrated Architectures

Chenyang Zhang,Feng Zhang,Xiaoguang Guo,Bingsheng He,Xiao Zhang,Xiaoyong Du
DOI: https://doi.org/10.1109/tpds.2020.3046870
IF: 5.3
2021-07-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Utilizing heterogeneous accelerators, especially GPUs, to accelerate machine learning tasks has shown to be a great success in recent years. GPUs bring huge performance improvements to machine learning and greatly promote the widespread adoption of machine learning. However, the discrete CPU-GPU architecture design with high PCIe transmission overhead decreases the GPU computing benefits in machine learning training tasks. To overcome such limitations, hardware vendors release CPU-GPU integrated architectures with shared unified memory. In this article, we design a benchmark suite for machine learning training on CPU-GPU integrated architectures, called iMLBench, covering a wide range of machine learning applications and kernels. We mainly explore two features on integrated architectures: 1) zero-copy, which means that the PCIe overhead has been eliminated for machine learning tasks and 2) co-running, which means that the CPU and the GPU co-run together to process a single machine learning task. Our experimental results on iMLBench show that the integrated architecture brings an average 7.1× performance improvement over the original implementations. Specifically, the zero-copy design brings 4.65× performance improvement, and co-running brings 1.78× improvement. Moreover, integrated architectures exhibit promising results from both performance-per-dollar and energy perspectives, achieving 6.50× performance-price ratio while 4.06× energy efficiency over discrete GPUs. The benchmark is open-sourced at https://github.com/ChenyangZhang-cs/iMLBench.
computer science, theory & methods,engineering, electrical & electronic
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance bottleneck encountered when training machine - learning tasks on the CPU - GPU integrated architecture. Specifically, in the traditional discrete CPU - GPU architecture design, there is the problem of high PCIe data - transfer overhead, which will reduce the computational efficiency of the GPU in machine - learning training tasks. To solve this problem, hardware manufacturers have introduced the CPU - GPU integrated architecture. This architecture has the characteristic of unified shared memory, which can eliminate PCIe - transfer overhead and support the CPU and GPU to run simultaneously to handle a single machine - learning task. However, currently there is a lack of a machine - learning benchmark - test suite specifically for the integrated architecture to evaluate the performance of these architectures on machine - learning tasks. Therefore, this paper proposes a benchmark - test suite named iMLBench, aiming to fill this gap. The design of iMLBench takes into account two main characteristics of the integrated architecture: 1. **Zero - Copy**: By eliminating PCIe - transfer overhead, improve the data - transfer efficiency of machine - learning tasks. 2. **Co - Running**: Enable the CPU and GPU to work together to jointly handle a single machine - learning task. The main contributions of the paper include: - Analyzing the existing benchmark - test suites and pointing out the problem of the lack of a machine - learning benchmark - test suite specifically for the integrated architecture. - Proposing iMLBench, which is the first machine - learning benchmark - test suite for the integrated architecture, covering eight common machine - learning tasks. - Evaluating iMLBench on the integrated architecture and comparing it with the discrete architecture, demonstrating the advantages of the integrated architecture in terms of performance, cost, and energy efficiency. Through these efforts, iMLBench not only helps to better utilize the characteristics of the integrated architecture, but also provides valuable references for future hardware design.