Abstract:Graphics processing units (GPUs) are continually evolving to cater to the computational demands of contemporary general-purpose workloads, particularly those driven by artificial intelligence (AI) utilizing deep learning techniques. A substantial body of studies have been dedicated to dissecting the microarchitectural metrics characterizing diverse GPU generations, which helps researchers understand the hardware details and leverage them to optimize the GPU programs. However, the latest Hopper GPUs present a set of novel attributes, including new tensor cores supporting FP8, DPX, and distributed shared memory. Their details still remain mysterious in terms of performance and operational characteristics. In this research, we propose an extensive benchmarking study focused on the Hopper GPU. The objective is to unveil its microarchitectural intricacies through an examination of the new instruction-set architecture (ISA) of Nvidia GPUs and the utilization of new CUDA APIs. Our approach involves two main aspects. Firstly, we conduct conventional latency and throughput comparison benchmarks across the three most recent GPU architectures, namely Hopper, Ada, and Ampere. Secondly, we delve into a comprehensive discussion and benchmarking of the latest Hopper features, encompassing the Hopper DPX dynamic programming (DP) instruction set, distributed shared memory, and the availability of FP8 tensor cores. The microbenchmarking results we present offer a deeper understanding of the novel GPU AI function units and programming features introduced by the Hopper architecture. This newfound understanding is expected to greatly facilitate software optimization and modeling efforts for GPU architectures. To the best of our knowledge, this study makes the first attempt to demystify the tensor core performance and programming instruction sets unique to Hopper GPUs.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of in - depth analysis and benchmarking of the micro - architectural features of the latest Nvidia Hopper GPU architecture. Specifically, the paper focuses on the following aspects: 1. **Revealing the new features of Hopper GPU**: - **Tensor Cores**: The Hopper GPU introduces new tensor cores that support FP8 precision, significantly enhancing the training and inference acceleration capabilities of large - scale language models (LLMs). - **Dynamic Programming Instruction Set (DPX)**: The DPX instruction set accelerates various dynamic programming algorithms, which usually involve a large number of min/max operations. - **Distributed Shared Memory (DSM)**: DSM allows direct communication between different streaming multiprocessors (SMs), including load, store, and atomic operations. - **Enhanced asynchronous execution mechanism**: The Hopper GPU enhances the asynchronous data transfer capability through the Tensor Memory Accelerator (TMA), improving efficiency. 2. **Performance evaluation and comparison**: - **Performance comparison across generations of GPUs**: The paper compares the performance of the three latest GPU architectures, Hopper, Ada, and Ampere, through conventional latency and throughput benchmark tests. - **Detailed analysis of tensor cores**: Starting from instruction - level testing and analysis, evaluate the differences in memory architecture and computational performance of tensor cores in different generations of GPUs. - **Performance evaluation of the Transformer engine**: For the linear layer, Transformer layer, and large - scale language model generation tasks of the Transformer model, evaluate the support and optimization effects of FP8 precision under the Hopper architecture. 3. **Evaluation of new CUDA programming features**: - **Performance evaluation of the DPX instruction set**: Through instruction latency and throughput tests, evaluate the performance of DPX functions and determine the location of their hardware acceleration. - **Efficiency evaluation of asynchronous data transfer**: Through experimental research, evaluate the asynchronous data transfer capability of TMA in the Hopper architecture, especially in matrix multiplication applications. - **Performance evaluation of distributed shared memory**: Through three benchmark tests (latency measurement, ring copy, and histogram application), evaluate the performance of DSM in data transfer and processing. ### Summary The main goal of the paper is to reveal the new features of the Hopper GPU and its performance advantages in artificial intelligence applications through detailed benchmarking and micro - architecture analysis. This not only helps researchers and developers better understand and optimize GPU programs but also provides valuable references for future GPU architecture design.

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability.

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

An Incremental Iterative Acceleration Architecture in Distributed Heterogeneous Environments With GPUs for Deep Learning

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

GPU Domain Specialization via Composable On-Package Architecture

Analyzing CUDA workloads using a detailed GPU simulator

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library

HIPU: A Hybrid Intelligent Processing Unit With Fine-Grained ISA for Real-Time Deep Neural Network Inference Applications

Dissecting GPU Memory Hierarchy Through Microbenchmarking

CuPBoP: CUDA for Parallelized and Broad-range Processors

A Heterogeneous Architecture for the Vision Processing Unit with a Hybrid Deep Neural Network Accelerator

Programming Framework for Node Heterogeneous GPU Cluster