Runtime Performance Anomaly Diagnosis in Production HPC Systems Using Active Learning
Burak Aksar,Efe Sencan,Benjamin Schwaller,Omar Aaziz,Vitus J. Leung,Jim Brandt,Brian Kulis,Manuel Egele,Ayse K. Coskun
DOI: https://doi.org/10.1109/tpds.2024.3365462
IF: 5.3
2024-03-09
IEEE Transactions on Parallel and Distributed Systems
Abstract:With the increasing scale and complexity of High-Performance Computing (HPC) systems, performance variations in applications caused by anomalies have become significant bottlenecks in system health and operational efficiency. As we move towards exascale systems, these variations become more prominent due to the increased sharing of resources. Such variations lead to lower energy efficiency and higher operational costs. To mitigate these problems, one must quickly and accurately diagnose the root cause of the anomalies at scale. One way to evaluate system health and identify the underlying causes is by manually examining certain performance metrics in telemetry data or using rule-based methods. Due to the daily size of telemetry data reaching terabytes and the fact that the numeric telemetry data contains thousands of metrics, manual analysis of telemetry to diagnose problems becomes challenging. Given these limitations, Machine Learning (ML)-based approaches have been gaining popularity as they have been shown to be effective and practical in diagnosing previously encountered performance anomalies. One primary challenge for supervised ML models is that they require a significant amount of labeled samples during training. However, obtaining many labels for anomalies is extremely difficult and costly, considering anomalies occur infrequently and real-world numeric system telemetry data is hard to label since it contains thousands of metrics. This paper proposes a novel active learning-based framework that diagnoses performance anomalies (i.e., identifying the type of an anomaly) in HPC systems at runtime using significantly fewer labeled samples compared to state-of-the-art ML-based approaches. We show that the proposed framework achieves the same F1-score compared to a supervised approach using much fewer labeled samples (i.e., 16x fewer samples for achieving a 0.78 F1-score, 11x fewer samples for achieving a 0.82 F1-score), even when there are previously unseen applications and application inputs in the test dataset.
computer science, theory & methods,engineering, electrical & electronic