An empirical study of major page faults for failure diagnosis in cluster systems

Edward Chuah,Arshad Jhumka,Sai Narasimhamurthy
DOI: https://doi.org/10.1007/s11227-023-05366-1
IF: 3.3
2023-05-15
The Journal of Supercomputing
Abstract:High-performance computing systems conduct extensive logging of resource usage data and system logs, and parsing these data is an often advocated basis for failure diagnosis. Major page faults are known to be one of the most common cause of performance problems in large cluster systems. We conduct an empirical study of major page faults on two large cluster systems. We set up three regression algorithms including the LASSO, Ridge and Elastic Net regression techniques. To the best of our knowledge, there is no work that studied different regression models to diagnose major page faults in a large cluster system. In this paper, we (a) propose an approach for diagnosing major page faults, and (b) evaluate the LASSO, Ridge and Elastic Net regression algorithms on real resource use data and system logs. As part of our contributions, we (a) compare the accuracy of the three regression algorithms, (b) identify the resource use counters which are correlated to major page faults and the system events which are correlated to page fault events, and (c) provide insights into major page faults and page fault events. Our work highlights empirical observations that could facilitate better handling of node failures in cluster systems.
computer science, theory & methods,engineering, electrical & electronic, hardware & architecture
What problem does this paper attempt to address?