Leveraging Code Snippets to Detect Variations in the Performance of HPC Systems
Jidong Zhai,Liyan Zheng,Jinghan Sun,Feng Zhang,Xiongchao Tang,Xuehai Qian,Bingsheng He,Wei Xue,Wenguang Chen,Weiming Zheng
DOI: https://doi.org/10.1109/tpds.2022.3158742
IF: 5.3
2022-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Variations in the performance of parallel and distributed systems are becoming increasingly challenging. The runtimes of different executions can vary greatly even with a fixed number of computing nodes. Many HPC applications on supercomputers exhibit such variance. This not only leads to unpredictable execution times, but also renders the system's behavior unintuitive. The efficient online detection of variations in performance is an open problem in HPC research. To solve it, we propose an approach, called vSensor, to detect variations in the performance of systems. The key finding of this study is that the source code of programs can better represent performance at runtime than an external detector. Specifically, many HPC applications contain code snippets that are fixed workload patterns of execution, e.g., the workload of an invariant quantity and a linearly growing workload. This observation allows us to automatically identify these snippets of workload-related code and use them to detect variations in performance. We evaluate vSensor on the Tianhe-2A system with a large number of parallel applications, and the results indicate that it can efficiently identify variations in system performance. The average overhead of 4,096 processes is less than 6% for fixed-workload v-sensors. We identify a problematic node with slow memory by using vSensor that degrades the performance of the program by 21%. A serious issue with network performance is also detected that slows down the Tianhe-2A system by 3.37 times for an HPC kernel.
computer science, theory & methods,engineering, electrical & electronic