Abstract:Utilizing GPUs is critical for high performance on heterogeneous systems. However, leveraging the full potential of GPUs for accelerating legacy CPU applications can be a challenging task for developers. The porting process requires identifying code regions amenable to acceleration, managing distinct memories, synchronizing host and device execution, and handling library functions that may not be directly executable on the device. This complexity makes it challenging for non-experts to leverage GPUs effectively, or even to start offloading parts of a large legacy application. In this paper, we propose a novel compilation scheme called "GPU First" that automatically compiles legacy CPU applications directly for GPUs without any modification of the application source. Library calls inside the application are either resolved through our partial libc GPU implementation or via automatically generated remote procedure calls to the host. Our approach simplifies the task of identifying code regions amenable to acceleration and enables rapid testing of code modifications on actual GPU hardware in order to guide porting efforts. Our evaluation on two HPC proxy applications with OpenMP CPU and GPU parallelism, four micro benchmarks with originally GPU only parallelism, as well as three benchmarks from the SPEC OMP 2012 suite featuring hand-optimized OpenMP CPU parallelism showcases the simplicity of porting host applications to the GPU. For existing parallel loops, we often match the performance of corresponding manually offloaded kernels, with up to 14.36x speedup on the GPU, validating that our GPU First methodology can effectively guide porting efforts of large legacy applications.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the challenges of porting traditional CPU applications to GPUs, especially for legacy CPU applications that were not originally designed with GPU acceleration in mind. Specifically, the paper attempts to solve the following key issues: 1. **Automatic Compilation Scheme**: A new compilation scheme called "GPU First" is proposed, which can directly compile traditional CPU applications into GPU executable code without modifying the source code. This method greatly simplifies the work for developers, especially for non-expert users. 2. **Library Function Call Handling**: When porting CPU programs to GPUs, handling external functions that can only be found in system or third-party libraries is a major challenge. The paper proposes a Remote Procedure Call (RPC) mechanism that allows GPU code to call these functions on the host as if they were local functions. This includes automatically identifying and replacing library function calls, ensuring that data (such as parameters and return values) can be correctly transferred between the GPU and the host. 3. **Multi-Team Execution and Kernel Splitting**: To achieve better GPU utilization, the paper also discusses a technique that can identify parallel regions that can be executed by multiple teams (or thread blocks) and convert them into kernels that can be executed by multiple teams. This helps to better distribute the workload and improve execution efficiency on the GPU. 4. **Memory Allocation and Tracking**: The paper also proposes a custom heap allocator to support memory allocation on the GPU. This allocator can be optimized according to specific application scenarios and can track allocated memory regions to determine object information at runtime, which is crucial for handling dynamically unknown objects. In summary, this research aims to reduce the difficulty of migrating traditional CPU applications to GPUs through automated compilation techniques and innovative execution strategies, thereby enabling more developers to leverage the powerful computing capabilities of GPUs.

GPU First -- Execution of Legacy CPU Codes on GPUs

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement

Mapcg: Writing Parallel Program Portable Between Cpu And Gpu

Providing Source Code Level Portability Between Cpu and Gpu with Mapcg

From GPU to CPU (and Beyond): Extending Hardware Support in GPUSPH Through a SYCL‐Inspired Interface

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper

A CPU-GPU Data Transfer Optimization Approach Based on Code Migration and Merging

A Translation Framework for Virtual Execution Environment on CPU/GPU Architecture

Effective GPU Sharing Under Compiler Guidance

Taking GPU Programming Models to Task for Performance Portability

A Compiler Translate Directive-Based Language to Optimized CUDA

Portability for GPU-accelerated molecular docking applications for cloud and HPC: can portable compiler directives provide performance across all platforms?

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

Improving Performance of GPU Specific OpenCL Program on CPUs

Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper

Gpu-Tls: An Efficient Runtime For Speculative Loop Parallelization On Gpus

Improving Performance Portability for GPU-specific OpenCL Kernels on Multi-Core/many-core CPUs by Analysis-Based Transformations

Performance Evaluation of Hybrid Programming Patterns for Large CPU/GPU Heterogeneous Clusters.

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems