Abstract:Utilizing heterogeneous accelerators, especially GPUs, to accelerate machine learning tasks has shown to be a great success in recent years. GPUs bring huge performance improvements to machine learning and greatly promote the widespread adoption of machine learning. However, the discrete CPU-GPU architecture design with high PCIe transmission overhead decreases the GPU computing benefits in machine learning training tasks. To overcome such limitations, hardware vendors release CPU-GPU integrated architectures with shared unified memory. In this article, we design a benchmark suite for machine learning training on CPU-GPU integrated architectures, called iMLBench, covering a wide range of machine learning applications and kernels. We mainly explore two features on integrated architectures: 1) zero-copy, which means that the PCIe overhead has been eliminated for machine learning tasks and 2) co-running, which means that the CPU and the GPU co-run together to process a single machine learning task. Our experimental results on iMLBench show that the integrated architecture brings an average 7.1× performance improvement over the original implementations. Specifically, the zero-copy design brings 4.65× performance improvement, and co-running brings 1.78× improvement. Moreover, integrated architectures exhibit promising results from both performance-per-dollar and energy perspectives, achieving 6.50× performance-price ratio while 4.06× energy efficiency over discrete GPUs. The benchmark is open-sourced at https://github.com/ChenyangZhang-cs/iMLBench.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the performance bottleneck encountered when training machine - learning tasks on the CPU - GPU integrated architecture. Specifically, in the traditional discrete CPU - GPU architecture design, there is the problem of high PCIe data - transfer overhead, which will reduce the computational efficiency of the GPU in machine - learning training tasks. To solve this problem, hardware manufacturers have introduced the CPU - GPU integrated architecture. This architecture has the characteristic of unified shared memory, which can eliminate PCIe - transfer overhead and support the CPU and GPU to run simultaneously to handle a single machine - learning task. However, currently there is a lack of a machine - learning benchmark - test suite specifically for the integrated architecture to evaluate the performance of these architectures on machine - learning tasks. Therefore, this paper proposes a benchmark - test suite named iMLBench, aiming to fill this gap. The design of iMLBench takes into account two main characteristics of the integrated architecture: 1. **Zero - Copy**: By eliminating PCIe - transfer overhead, improve the data - transfer efficiency of machine - learning tasks. 2. **Co - Running**: Enable the CPU and GPU to work together to jointly handle a single machine - learning task. The main contributions of the paper include: - Analyzing the existing benchmark - test suites and pointing out the problem of the lack of a machine - learning benchmark - test suite specifically for the integrated architecture. - Proposing iMLBench, which is the first machine - learning benchmark - test suite for the integrated architecture, covering eight common machine - learning tasks. - Evaluating iMLBench on the integrated architecture and comparing it with the discrete architecture, demonstrating the advantages of the integrated architecture in terms of performance, cost, and energy efficiency. Through these efforts, iMLBench not only helps to better utilize the characteristics of the integrated architecture, but also provides valuable references for future hardware design.

iMLBench: A Machine Learning Benchmark Suite for CPU-GPU Integrated Architectures

DaDianNao: A Machine-Learning Supercomputer

Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures

BENCHIP： Benchmarking Intelligence Processors

Benchmarking Edge AI Platforms for High-Performance ML Inference

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

To Co-run, or Not to Co-run: A Performance Study on Integrated Architectures.

Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems

Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

Introducing Milabench: Benchmarking Accelerators for AI

PuDianNao: A Polyvalent Machine Learning Accelerator

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning

AIBench Training: Balanced Industry-Standard AI Training Benchmarking

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Revisiting Linpack Algorithm on Large-scale CPU-GPU Heterogeneous Systems

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA