Revisiting the performance optimization of QR factorization on Intel KNL and SKL multiprocessors
Muhammad Rizwan,Enoch Jung,Jongsun Choi,Jaeyoung Choi
DOI: https://doi.org/10.1007/s11227-024-06002-2
IF: 3.3
2024-03-14
The Journal of Supercomputing
Abstract:This study focused on the optimization of double-precision general matrix–matrix multiplication (DGEMM) routine to improve the QR factorization performance. By replacing the MKL DGEMM with our previously developed blocked matrix–matrix multiplication routine, we found that the QR factorization performance was suboptimal due to a bottleneck in the matrix–panel multiplication operation. We present an investigation of the limitations of our matrix–matrix multiplication routine. It was found that the performance of the matrix multiplication routine depends on the shape and size of the matrices. Therefore, we recommend different kernels tailored to matrix shapes involved in QR factorization and developed a new routine for the matrix–panel multiplication operation. We demonstrated the performance of the proposed kernels on the ScaLAPACK QR factorization routine by comparing them with the MKL, OPENBLAS, and BLIS libraries. Our proposed optimization demonstrates significant performance improvements in the multinode cluster environments of the Intel Xeon Phi Processor 7250 codenamed Knights Landing (KNL) and Intel Xeon Gold 6148 Scalable Skylake Processor (SKL).
computer science, theory & methods,engineering, electrical & electronic, hardware & architecture