Abstract:With the increasing scale and complexity of High-Performance Computing (HPC) systems, performance variations in applications caused by anomalies have become significant bottlenecks in system health and operational efficiency. As we move towards exascale systems, these variations become more prominent due to the increased sharing of resources. Such variations lead to lower energy efficiency and higher operational costs. To mitigate these problems, one must quickly and accurately diagnose the root cause of the anomalies at scale. One way to evaluate system health and identify the underlying causes is by manually examining certain performance metrics in telemetry data or using rule-based methods. Due to the daily size of telemetry data reaching terabytes and the fact that the numeric telemetry data contains thousands of metrics, manual analysis of telemetry to diagnose problems becomes challenging. Given these limitations, Machine Learning (ML)-based approaches have been gaining popularity as they have been shown to be effective and practical in diagnosing previously encountered performance anomalies. One primary challenge for supervised ML models is that they require a significant amount of labeled samples during training. However, obtaining many labels for anomalies is extremely difficult and costly, considering anomalies occur infrequently and real-world numeric system telemetry data is hard to label since it contains thousands of metrics. This paper proposes a novel active learning-based framework that diagnoses performance anomalies (i.e., identifying the type of an anomaly) in HPC systems at runtime using significantly fewer labeled samples compared to state-of-the-art ML-based approaches. We show that the proposed framework achieves the same F1-score compared to a supervised approach using much fewer labeled samples (i.e., 16x fewer samples for achieving a 0.78 F1-score, 11x fewer samples for achieving a 0.82 F1-score), even when there are previously unseen applications and application inputs in the test dataset.

Advancing Anomaly Detection in Computational Workflows with Active Learning

Data anomaly detection for structural health monitoring using a combination network of GANomaly and CNN

Adversarial Attacks and Mitigation for Anomaly Detectors of Cyber-Physical Systems

Self-supervised Learning for Anomaly Detection in Computational Workflows

Flow-Bench: A Dataset for Computational Workflow Anomaly Detection

Runtime Performance Anomaly Diagnosis in Production HPC Systems Using Active Learning

Large Language Models for Anomaly Detection in Computational Workflows: from Supervised Fine-Tuning to In-Context Learning

Active Learning Algorithm for Computational Physics

Research on Dynamic Data Flow Anomaly Detection based on Machine Learning

Graph neural networks for detecting anomalies in scientific workflows

Research of Anomaly detection based on Dynamic Anomaly Detection Enhancement Framework

Learning to Detect Interesting Anomalies

Active Learning Methodology for Expert-Assisted Anomaly Detection in Mobile Communications

No Need to Know Physics: Resilience of Process-based Model-free Anomaly Detection for Industrial Control Systems

LSTM-based Anomaly Detection for Non-linear Dynamical System

Anomaly detection in the CERN cloud infrastructure

Adaptive Anomaly Detection in Network Flows with Low-Rank Tensor Decompositions and Deep Unrolling

Design of Active Learning Framework for Collaborative Anomaly Detection

Application of Improved Asynchronous Advantage Actor Critic Reinforcement Learning Model on Anomaly Detection

Online Anomaly Detection in HPC Systems

A novel and robust data anomaly detection framework using LAL-AdaBoost for structural health monitoring