Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

Yufan Xia,Giuseppe Maria Junior Barca

2024-06-28

Abstract:BLAS Level 3 operations are essential for scientific computing, but finding the optimal number of threads for multi-threaded implementations on modern multi-core systems is challenging. We present an extension to the Architecture and Data-Structure Aware Linear Algebra (ADSALA) library that uses machine learning to optimize the runtime of all BLAS Level 3 operations. Our method predicts the best number of threads for each operation based on the matrix dimensions and the system architecture. We test our method on two HPC platforms with Intel and AMD processors, using MKL and BLIS as baseline BLAS implementations. We achieve speedups of 1.5 to 3.0 for all operations, compared to using the maximum number of threads. We also analyze the runtime patterns of different BLAS operations and explain the sources of speedup. Our work shows the effectiveness and generality of the ADSALA approach for optimizing BLAS routines on modern multi-core systems.

Distributed, Parallel, and Cluster Computing,Machine Learning

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of optimizing the runtime performance of BLAS Level 3 operations (Basic Linear Algebra Subprograms Level 3) on modern multi - core systems, especially how to find the optimal number of threads for multi - threaded implementation. Specifically: 1. **Background problems**: - BLAS Level 3 operations are crucial in scientific computing. - On modern multi - core systems, finding the optimal number of threads to achieve the best performance for these operations is a challenge. 2. **Existing challenges**: - There has been a lot of research on BLAS Level 3 optimization on single - threaded CPUs, but less on multi - core CPUs because of their higher complexity and diversity. - Traditional optimization methods such as parameter tuning and blocking are effective, but have limited effectiveness on multi - core systems. 3. **Research objectives**: - Develop a machine - learning - based method to optimize the runtime of all BLAS Level 3 operations. - Improve performance by predicting the optimal number of threads for each operation and dynamically selecting the optimal number of threads according to matrix dimensions and system architecture. 4. **Specific contributions**: - Propose an extension to the Architecture and Data - Structure Aware Linear Algebra (ADSALA) library, using a machine - learning model to optimize the runtime of BLAS Level 3 operations. - Test this method on two high - performance computing platforms (HPC), using Intel and AMD processors respectively, and with MKL and BLIS as baseline BLAS implementations. - Achieve a 1.5 - to - 3.0 - fold speedup compared to the case of using the maximum number of threads. ### Summary The main goal of this paper is to solve the problem of optimizing the runtime performance of BLAS Level 3 operations on modern multi - core systems by introducing machine - learning techniques, especially in terms of selecting the optimal number of threads. This method not only improves performance but also shows the application potential of machine learning in high - performance computing.

Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing

Multi-Threaded Dense Linear Algebra Libraries for Low-Power Asymmetric Multicore Processors

Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture

Scheduling optimization of parallel linear algebra algorithms using Supervised Learning

Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering

Optimization Of Triangular Matrix Functions In Blas Library On Loongson2f

Developing a BLAS library for the AMD AI Engine

Accelerated linear algebra compiler for computationally efficient numerical models: Success and potential area of improvement

FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs

Research on the Optimization of BLAS Level 1 and 2 Functions on Shenwei Many-Core Processor

Lasa: Abstraction and Specialization for Productive and Performant Linear Algebra on FPGAs

Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

Optimization of BLAS based on Loongson 2F architecture

Optimizing Sparse Linear Algebra Through Automatic Format Selection and Machine Learning

Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance

Performance Optimization for Sparse A(T)Ax in Parallel on Multicore Cpu

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems

Smat: An Input Adaptive Auto-Tuner For Sparse Matrix-Vector Multiplication

Scaling Support Vector Machines on Modern HPC Platforms

Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors