Abstract:Utilizing GPUs is critical for high performance on heterogeneous systems. However, leveraging the full potential of GPUs for accelerating legacy CPU applications can be a challenging task for developers. The porting process requires identifying code regions amenable to acceleration, managing distinct memories, synchronizing host and device execution, and handling library functions that may not be directly executable on the device. This complexity makes it challenging for non-experts to leverage GPUs effectively, or even to start offloading parts of a large legacy application. In this paper, we propose a novel compilation scheme called "GPU First" that automatically compiles legacy CPU applications directly for GPUs without any modification of the application source. Library calls inside the application are either resolved through our partial libc GPU implementation or via automatically generated remote procedure calls to the host. Our approach simplifies the task of identifying code regions amenable to acceleration and enables rapid testing of code modifications on actual GPU hardware in order to guide porting efforts. Our evaluation on two HPC proxy applications with OpenMP CPU and GPU parallelism, four micro benchmarks with originally GPU only parallelism, as well as three benchmarks from the SPEC OMP 2012 suite featuring hand-optimized OpenMP CPU parallelism showcases the simplicity of porting host applications to the GPU. For existing parallel loops, we often match the performance of corresponding manually offloaded kernels, with up to 14.36x speedup on the GPU, validating that our GPU First methodology can effectively guide porting efforts of large legacy applications.

A Translation Framework for Virtual Execution Environment on CPU/GPU Architecture

A Translation Framework for Executing the Sequential Binary Code on CPU/GPU Based Architectures

A compiler framework for translating standard C into optimized CUDA code

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

Mapcg: Writing Parallel Program Portable Between Cpu And Gpu

IMPLEMENTATION FRAMEWORK FOR BINARY STREAM PATTERN EXTRACTION UNDER CPU/GPU

A Dynamic Binary Translation Framework Based on Page Fault Mechanism in Linux Kernel

A Compiler Translate Directive-Based Language to Optimized CUDA

Novel automatic mapping technology on CPU-GPU heteroge-neous systems

GXBIT: Combining polyhedral model with dynamic binary translation

Two-phase Execution of Binary Applications on CPU/GPU Machines

BabelTower: Learning to Auto-parallelized Program Translation.

A Programming Framework Based on Multi-GPU

GPU First -- Execution of Legacy CPU Codes on GPUs

GPU-S2S: A Compiler for Source-to-Source Translation on GPU

Providing Source Code Level Portability Between Cpu and Gpu with Mapcg

Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program

UKCF: A New Graphics Driver Cross-Platform Translation Framework for Virtual Machines.

A Polyhedral Modeling Based Source-to-Source Code Optimization Framework for GPGPU

High-Throughput Sequence Translation Using Cuda

swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputer