Abstract:The rising demand for high-performance computing (HPC) has made full-chip dynamic thermal simulation in many-core GPUs critical for optimizing performance and extending device lifespans. Proper orthogonal decomposition (POD) with Galerkin projection (GP) has shown to offer high accuracy and massive runtime improvements over direct numerical simulation (DNS). However, previous implementations of POD-GP use MPI-based libraries like PETSc and FEniCS and face significant runtime bottlenecks. We propose a $\textbf{Py}$Torch-based $\textbf{POD-GP}$ library (PyPOD-GP), a GPU-optimized library for chip-level thermal simulation. PyPOD-GP achieves over $23.4\times$ speedup in training and over $10\times$ speedup in inference on a GPU with over 13,000 cores, with just $1.2\%$ error over the device layer.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **the full - chip dynamic thermal simulation problem of multi - core GPUs in high - performance computing (HPC)**. Specifically, with the increase in high - performance computing requirements, efficient dynamic thermal simulation tools are required to optimize performance and extend device life. Although traditional direct numerical simulation (DNS) has high precision, its computational cost is huge, and while other alternative methods improve efficiency, they sacrifice precision or resolution. ### Specific description of the problem: 1. **Thermal management challenges brought by high - density processors**: - Modern chip designs significantly increase the power density of processors, resulting in high - temperature gradients and hot spots, thereby reducing performance and reliability. - Although the implementation of dynamic thermal management systems can alleviate these problems, efficient and high - precision thermal simulation tools are still needed to support them. 2. **Limitations of existing methods**: - **Direct numerical simulation (DNS)**: Although it provides an accurate temperature solution, its computational cost is extremely high due to its high degrees of freedom (DoF). - **Other alternative methods**: Although they improve efficiency, they sacrifice precision or resolution. 3. **Advantages and bottlenecks of the POD - GP method**: - The **POD - GP method**, which combines proper orthogonal decomposition (POD) and Galerkin projection (GP), can significantly increase the running speed while maintaining high precision. - However, previous POD - GP implementations rely on MPI libraries (such as PETSc and FEniCS) and face significant runtime bottlenecks during training and inference, especially when applied to GPUs with a large number of cores. ### Solution: To solve the above problems, the author proposes **PyPOD - GP**, a PyTorch - based GPU - optimized library for chip - level thermal simulation. By leveraging PyTorch's tensor operations, PyPOD - GP achieves a higher acceleration effect than CPU - based implementations, specifically: - **Training speed improvement**: On an NVIDIA Tesla Volta GV100 GPU, PyPOD - GP achieves a training acceleration of more than 23.4 times. - **Inference speed improvement**: On the same hardware, PyPOD - GP achieves an inference acceleration of more than 10 times. - **High precision**: At the device level, the error of PyPOD - GP is only 1.2%, demonstrating its potential in large - scale GPU architectures. ### Summary: This paper aims to provide an efficient and accurate GPU - accelerated thermal simulation tool by developing the PyPOD - GP library to meet the dynamic thermal management requirements of multi - core GPUs in high - performance computing. This not only improves the efficiency of thermal simulation but also makes real - time thermal monitoring and multi - device prediction possible.

PyPOD-GP: Using PyTorch for Accelerated Chip-Level Thermal Simulation of the GPU

PODTherm-GP: A Physics-based Data-Driven Approach for Effective Architecture-Level Thermal Simulation of Multi-Core CPUs

Predicting Accurate Hot Spots in a More Than Ten-Thousand-Core GPU with a Million-Time Speedup over FEM Enabled by a Physics-based Learning Algorithm

Heterogeneous Programming and Optimization of Gyrokinetic Toroidal Code and Large-Scale Performance Test on TH-1A.

The Implementation of the Three-Dimensional Unified Gas-Kinetic Wave-Particle Method on Multiple Graphics Processing Units

GPU_PBTE: an Efficient Solver for Three and Four Phonon Scattering Rates on Graphics Processing Units

Accelerating atmospheric physics parameterizations using graphics processing units

Accelerating Pythonic coupled cluster implementations: a comparison between CPUs and GPUs

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

GPU coprocessors as a service for deep learning inference in high energy physics

GPU-HADVPPM4HIP V1.0: using the heterogeneous-compute interface for portability (HIP) to speed up the piecewise parabolic method in the CAMx (v6.10) air quality model on China's domestic GPU-like accelerator

High Performance Computing Via a GPU

Accelerating Pythonic Coupled-Cluster Implementations: A Comparison Between CPUs and GPUs

GPU acceleration of an iterative scheme for gas-kinetic model equations with memory reduction techniques

Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels

GPU-HADVPPM V1.0: a high-efficiency parallel GPU design of the piecewise parabolic method (PPM) for horizontal advection in an air quality model (CAMx V6.10)

Study of A Gpu-Based Parallel Computing Method for the Monte Carlo Program

Evaluation of Portable Acceleration Solutions for LArTPC Simulation Using Wire-Cell Toolkit

GPU Domain Specialization via Composable On-Package Architecture

Particle-in-Cell Code for GPU Systems