Abstract:Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or other-wise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important ap-plications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

CuPBoP: CUDA for Parallelized and Broad-range Processors

CuPBoP: Making CUDA a Portable Language

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

Mapcg: Writing Parallel Program Portable Between Cpu And Gpu

Supporting CUDA for an extended RISC-V GPU architecture

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond

Providing Source Code Level Portability Between Cpu and Gpu with Mapcg

BabelTower: Learning to Auto-parallelized Program Translation.

GPT-Driven Source-to-Source Transformation for Generating Compilable Parallel CUDA Code for Nussinov's Algorithm

CUDA-Zero: a Framework for Porting Shared Memory GPU Applications to Multi-Gpus

GPU First -- Execution of Legacy CPU Codes on GPUs

Taking GPU Programming Models to Task for Performance Portability

A Metaprogramming and Autotuning Framework for Deploying Deep Learning Applications

Impact of CUDA and OpenCL on Parallel and Distributed Computing

POPA: Expressing High and Portable Performance Across Spatial and Vector Architectures for Tensor Computations

Boosting CUDA Applications with CPU–GPU Hybrid Computing

Programming Framework for Node Heterogeneous GPU Cluster

HeteroPP: A directive‐based heterogeneous cooperative parallel programming framework

Analyzing CUDA workloads using a detailed GPU simulator

Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures