Abstract:BLAS is a fundamental building block of advanced linear algebra libraries and many modern scientific computing applications. GPUs are known for their strong arithmetic computing capabilities and are highly suited for BLAS operations. However, porting code to GPUs often requires significant effort, especially for large, complex codes or legacy codes, even for BLAS-heavy applications. While various tools exist to automatically offload BLAS to GPUs, they are often impractical due to the high costs associated with mandatory data transfers. The advent of unified memory architectures in recent GPU designs, such as the NVIDIA Grace-Hopper, allows cache-coherent memory access across all types of memory for both CPU and GPU, potentially eliminating the bottlenecks faced in conventional architectures. This breakthrough paves the way for innovative application developments and porting strategies. Building on our preliminary work demonstrating the potential of automatic *gemm offload, this paper extends the framework to all level-3 BLAS operations and introduces SCILIB-Accel, a novel tool for automatic BLAS offload. SCILIB-Accel leverages the memory coherency in Grace-Hopper and introduces a Device First-Use data movement policy inspired by the OpenMP First-Touch approach in multi-socket CPU programming, minimizing CPU-GPU data transfers for typical scientific computing codes. Additionally, utilizing dynamic binary instrumentation, the tool intercepts BLAS symbols directly from a CPU binary, requiring no code modifications or recompilation. SCILIB-Accel has been evaluated using multiple quantum physics codes on up to a few hundred GPU nodes, yielding promising speedups. Notably, for the LSMS method in the MuST suite, a 3x speedup was achieved on Grace-Hopper compared to Grace-Grace.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the performance bottlenecks and complexity issues encountered when automatically migrating linear algebra library (BLAS) operations to the GPU. Specifically, the author focuses on the following points: 1. **Data transfer overhead in traditional architectures**: - In the traditional CPU - GPU architecture, frequent data transfers (especially for small - and medium - sized matrix operations) will lead to significant performance overhead, making automatic migration of BLAS operations impractical. - The paper points out that although there are various tools that can automatically migrate BLAS operations to the GPU, these tools are often not feasible due to the high - cost data transfer. 2. **Advantages and challenges of the unified memory architecture**: - With the introduction of the unified memory architecture (Unified Memory Architecture) in new GPU designs such as NVIDIA Grace - Hopper, this architecture allows cache - consistent memory access between the CPU and the GPU, which can theoretically eliminate the bottlenecks in traditional architectures. - However, even under the unified memory architecture, data locality is still crucial, and developers need to optimize the data movement patterns to maximize performance. 3. **The need for automated high - performance BLAS migration**: - Scientific computing applications (such as in the fields of quantum chemistry, physics, etc.) widely rely on BLAS operations, and modern GPUs have powerful arithmetic computing capabilities and are very suitable for executing BLAS operations. - However, migrating large and complex codes or legacy codes to the GPU usually requires a great deal of work, especially when dealing with BLAS - intensive applications. Therefore, developing an efficient and automated BLAS migration tool is of great significance. ### Solutions To solve the above problems, the author proposes a new tool named SCILIB - Accel, which achieves the following innovations: 1. **Data movement strategy based on OpenMP First - Touch style**: - A data management strategy called "Device First - Use" is introduced, inspired by the OpenMP First - Touch method. This strategy minimizes data transfer between the CPU and the GPU by moving data to the GPU memory when it is first used by the GPU kernel. - This strategy is especially suitable for scientific computing codes because these codes usually reuse intermediate matrices, reducing unnecessary data transfers. 2. **Dynamic binary instrumentation technology**: - Dynamic Binary Instrumentation (DBI) technology is used to intercept BLAS symbols and directly call the GPU BLAS library from the CPU binary file without modifying or recompiling the code. - This enables SCILIB - Accel to achieve efficient BLAS migration without changing the existing code. 3. **Performance evaluation**: - Through testing multiple quantum physics codes (such as the LSMS method in the MuST suite), the results show that SCILIB - Accel achieves approximately a 3 - fold acceleration on the Grace - Hopper system compared to the Grace - Grace system. - Especially in large - scale parallel environments (such as hundreds of GPU nodes), SCILIB - Accel shows a significant speed increase. In conclusion, this paper solves the performance bottleneck problem of automatic BLAS migration in traditional architectures by introducing new data management and automated migration tools, and makes full use of the advantages of the unified memory architecture, providing an efficient GPU acceleration solution for scientific computing applications.

Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper

GPU First -- Execution of Legacy CPU Codes on GPUs

BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing

Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper

Developing a BLAS library for the AMD AI Engine

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs

A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression

GPU Implementation of a Sophisticated Implicit Low-Order Finite Element Solver with FP21-32-64 Computation Using OpenACC

Implementing implicit OpenMP data sharing on GPUs

Explicit caching HYB: a new high-performance SpMV framework on GPGPU

OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs

Hybrid programming-model strategies for GPU offloading of electronic structure calculation kernels

Multi-GPU Performance Optimization of a CFD Code using OpenACC on Different Platforms

Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

Multi-GPU thermal lattice Boltzmann simulations using OpenACC and MPI

Unified schemes for directive-based GPU offloading

Accelerating an Iterative Eigensolver for Nuclear Structure Configuration Interaction Calculations on GPUs Using OpenACC

Performance Portability of Sparse Block Diagonal Matrix Multiple Vector Multiplications on GPUs

Particle-resolved thermal lattice Boltzmann simulation using OpenACC on multi-GPUs