Automatic Multi-Parameter Performance Modeling of HPC Applications on a New Sunway Supercomputer
Yilian Zhang,Yao Liu,Penglong Jiao,Yiping Zhou,Tongquan Wei
DOI: https://doi.org/10.1109/tpds.2023.3317296
IF: 5.3
2023-10-04
IEEE Transactions on Parallel and Distributed Systems
Abstract:As the successor to Sunway TaihuLight, the new Sunway supercomputer has ultra-high computing capacity, but the unique heterogeneous architecture presents performance optimization challenges for High Performance Computing (HPC) applications. Performance modeling is an effective way to discover the performance bottlenecks and then improve the performance of HPC applications. Existing performance modeling techniques do not work well on large-scale HPC applications due to high overhead and low accuracy, and are not suitable for the heterogeneous architecture due to a lack of support for multi-resource parameters. To address the above challenges, we propose an automatic multi-parameter performance modeling method for HPC applications on the new Sunway supercomputer. First, a lightweight performance profiling method is proposed to achieve low overhead performance profiling. Then, performance models with multiple resource parameters based on the Fourier neural operator are built, achieving high prediction accuracy and generalization ability. Finally, the Fourier neural operator is extended on the new Sunway supercomputer to realize the performance modeling automatically. Experimental results show that the average prediction error is less than 10% and the average overhead is less than 4%, and the results are superior to the baselines.
computer science, theory & methods,engineering, electrical & electronic