Abstract:Utilizing GPUs is critical for high performance on heterogeneous systems. However, leveraging the full potential of GPUs for accelerating legacy CPU applications can be a challenging task for developers. The porting process requires identifying code regions amenable to acceleration, managing distinct memories, synchronizing host and device execution, and handling library functions that may not be directly executable on the device. This complexity makes it challenging for non-experts to leverage GPUs effectively, or even to start offloading parts of a large legacy application. In this paper, we propose a novel compilation scheme called "GPU First" that automatically compiles legacy CPU applications directly for GPUs without any modification of the application source. Library calls inside the application are either resolved through our partial libc GPU implementation or via automatically generated remote procedure calls to the host. Our approach simplifies the task of identifying code regions amenable to acceleration and enables rapid testing of code modifications on actual GPU hardware in order to guide porting efforts. Our evaluation on two HPC proxy applications with OpenMP CPU and GPU parallelism, four micro benchmarks with originally GPU only parallelism, as well as three benchmarks from the SPEC OMP 2012 suite featuring hand-optimized OpenMP CPU parallelism showcases the simplicity of porting host applications to the GPU. For existing parallel loops, we often match the performance of corresponding manually offloaded kernels, with up to 14.36x speedup on the GPU, validating that our GPU First methodology can effectively guide porting efforts of large legacy applications.

HeteroCore GPU to Exploit TLP-Resource Diversity

Efficient GPU Spatial-Temporal Multitasking

Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Parallel Transient Stability-Constrained Optimal Power Flow Using GPU as Coprocessor.

Efficient Resource Sharing Through GPU Virtualization on Accelerated High Performance Computing Systems

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

Efficient Kernel Management on GPUs.

Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads

A user mode CPU–GPU scheduling framework for hybrid workloads

Enhanced GPU Resource Utilization through Fairness-aware Task Scheduling

Exploring the Diversity of Multiple Job Deployments over GPUs for Efficient Resource Sharing

Task Scheduling Greedy Heuristics for GPU Heterogeneous Cluster Involving the Weights of the Processor

Augmenting Operating Systems With the GPU

Concurrent CPU-GPU Task Programming using Modern C++

Boosting CUDA Applications with CPU–GPU Hybrid Computing

HW/SW Co-Optimization for Stencil Computation: Beginning with a Customizable Core

HeteroCPPR: Accelerating Common Path Pessimism Removal with Heterogeneous CPU-GPU Parallelism

GPU First -- Execution of Legacy CPU Codes on GPUs

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors