Characterize and Optimize Dense Linear Solver on Multi-core CPUs

Xiao Fu,Xing Su,Dezun Dong,Weiling Yang
DOI: https://doi.org/10.1109/icpads60453.2023.00253
2023-01-01
Abstract:The dense linear solver is an essential subroutine in high-performance computing. Typical parallel implementations either adopt the fork-join or task parallel programming models. Blocked algorithms built upon the fork-join paradigm focus on optimizing cache locality, leaving significant synchronization overhead. Following the data-driven execution model, tile-based algorithms formed on the task parallel paradigm effectively relieve the pain and exhibit superior load balancing. Nevertheless, they introduce redundant memory access expenses, plaguing the CPU execution. In this paper, we first characterize and quantify the impact of the performance bottlenecks in-depth and then propose a series of optimizations. Specifically, we reduce the idle time of threads by merging LU factorization with the subsequent lower triangular solver to improve parallelism. Moreover, we eliminate tile-based matrix format transformation and diminish duplicated data packing operations to lower memory access overhead. Performance evaluation is conducted on two modern multi-core systems, Intel Xeon Gold(R) 6252N and HiSilicon Kunpeng 920. The evaluation results demonstrate the superiority of our proposed solver over state-of-the-art open-source implementations, achieving performance gains of up to 11.5% and 12.2% on the respective platforms.
What problem does this paper attempt to address?