Abstract:Architecture designers tend to integrate both CPUs and GPUs on the same chip to deliver energy-efficient designs. It is still an open problem to effectively leverage the advantages of both CPUs and GPUs on integrated architectures. In this work, we port 42 programs in Rodinia, Parboil, and Polybench benchmark suites and analyze the co-running behaviors of these programs on both AMD and Intel integrated architectures. We find that co-running performance is not always better than running the program only with CPUs or GPUs. Among these programs, only eight programs can benefit from the co-running, while 24 programs only using GPUs and seven programs only using CPUs achieve the best performance. The remaining three programs show little performance preference for different devices. Through extensive workload characterization analysis, we find that architecture differences between CPUs and GPUs and limited shared memory bandwidth are two main factors affecting current co-running performance. Since not all the programs can benefit from integrated architectures, we build an automatic decision-tree-based model to help application developers predict the co-running performance for a given CPU-only or GPU-only program. Results show that our model correctly predicts 14 programs out of 15 for evaluated programs. For a co-run friendly program, we further propose a profiling-based method to predict the optimal workload partition ratio between CPUs and GPUs. Results show that our model can achieve 87.7 percent of the optimal performance relative to the best partition. The co-running programs acquired with our method outperform the original CPU-only and GPU-only programs by 34.5 and 20.9 percent respectively.

Understanding Data Partition for Applications on CPU-GPU Integrated Processors.

Data Partitioning Strategy of GPU Heterogeneous Clusters Based on Learning

FinePar: Irregularity-aware Fine-Grained Workload Partitioning on Integrated Architectures

A CPU-GPGPU Scheduler Based on Data Transmission Bandwidth of Workload

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

Multithread Content Based File Chunking System in CPU-GPGPU Heterogeneous Architecture

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

A perceptual and predictive batch-processing memory scheduling strategy for a CPU-GPU heterogeneous system

Analyzing Memory Access on CPU-GPGPU Shared LLC Architecture

Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture

Experience Of Parallelizing Cryo-Em 3d Reconstruction On A Cpu-Gpu Heterogeneous System

A Hybrid Sorting Algorithm on Heterogeneous Architectures

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

Locality-aware Thread Block Design in Single and Multi-GPU Graph Processing

Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures

Process variation-aware workload partitioning algorithms for GPUs supporting spatial-multitasking

A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing

Efficient Resource Sharing Through GPU Virtualization on Accelerated High Performance Computing Systems

Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems