Abstract:Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or other-wise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important ap-plications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

To Co-run, or Not to Co-run: A Performance Study on Integrated Architectures.

Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures

Characterizing the Performance of Emerging Deep Learning, Graph, and High Performance Computing Workloads Under Interference

iMLBench: A Machine Learning Benchmark Suite for CPU-GPU Integrated Architectures

FinePar: Irregularity-aware Fine-Grained Workload Partitioning on Integrated Architectures

A Hybrid Reconfigurable Architecture and Design Methods Aiming at Control-Intensive Kernels

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Fine-Grained Multi-Query Stream Processing on Integrated Architectures

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

Analyzing CUDA workloads using a detailed GPU simulator

Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Implementing Performance Portability of High Performance Computing Programs in the New Golden Age of Chip Architecture

Characterizing the Execution Dynamics of GPGPU Applications

Exploiting co-execution with oneAPI: heterogeneity from a modern perspective

Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program

Design Space Exploration of Embedded SoC Architectures for Real-Time Optimal Control

HW/SW Co-Optimization for Stencil Computation: Beginning with a Customizable Core