Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

Yufan Xia,Giuseppe Maria Junior Barca
2024-06-28
Abstract:BLAS Level 3 operations are essential for scientific computing, but finding the optimal number of threads for multi-threaded implementations on modern multi-core systems is challenging. We present an extension to the Architecture and Data-Structure Aware Linear Algebra (ADSALA) library that uses machine learning to optimize the runtime of all BLAS Level 3 operations. Our method predicts the best number of threads for each operation based on the matrix dimensions and the system architecture. We test our method on two HPC platforms with Intel and AMD processors, using MKL and BLIS as baseline BLAS implementations. We achieve speedups of 1.5 to 3.0 for all operations, compared to using the maximum number of threads. We also analyze the runtime patterns of different BLAS operations and explain the sources of speedup. Our work shows the effectiveness and generality of the ADSALA approach for optimizing BLAS routines on modern multi-core systems.
Distributed, Parallel, and Cluster Computing,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of optimizing the runtime performance of BLAS Level 3 operations (Basic Linear Algebra Subprograms Level 3) on modern multi - core systems, especially how to find the optimal number of threads for multi - threaded implementation. Specifically: 1. **Background problems**: - BLAS Level 3 operations are crucial in scientific computing. - On modern multi - core systems, finding the optimal number of threads to achieve the best performance for these operations is a challenge. 2. **Existing challenges**: - There has been a lot of research on BLAS Level 3 optimization on single - threaded CPUs, but less on multi - core CPUs because of their higher complexity and diversity. - Traditional optimization methods such as parameter tuning and blocking are effective, but have limited effectiveness on multi - core systems. 3. **Research objectives**: - Develop a machine - learning - based method to optimize the runtime of all BLAS Level 3 operations. - Improve performance by predicting the optimal number of threads for each operation and dynamically selecting the optimal number of threads according to matrix dimensions and system architecture. 4. **Specific contributions**: - Propose an extension to the Architecture and Data - Structure Aware Linear Algebra (ADSALA) library, using a machine - learning model to optimize the runtime of BLAS Level 3 operations. - Test this method on two high - performance computing platforms (HPC), using Intel and AMD processors respectively, and with MKL and BLIS as baseline BLAS implementations. - Achieve a 1.5 - to - 3.0 - fold speedup compared to the case of using the maximum number of threads. ### Summary The main goal of this paper is to solve the problem of optimizing the runtime performance of BLAS Level 3 operations on modern multi - core systems by introducing machine - learning techniques, especially in terms of selecting the optimal number of threads. This method not only improves performance but also shows the application potential of machine learning in high - performance computing.