Abstract:Diagnosis and self-healing anomalies are an important as pectin the operations for data centers and for computing in utility clouds. Data centers which offer greater size along with heterogeneous and complex applications, software and workload systems such as anomaly/fault detection require automatic operation with runtime self-healing. Moreover, they must do so without requirement of having previous knowledge regarding normal and/or anomalous performance. Diagnosis must function for levels of different abstraction such as hardware and software, along with multiple time-series metrics for the use in environments like cloud computing. By classifying and analyzing metrics that are qualitative can offer new methods for anomaly and fault diagnosis for their distribution instead of simply providing individual metric thresholds. Categorical counterparts (binning) and normal distribution are used as a measurement that captures the extent of the distributions for concentration or dispersion so that raw metric data can be gathered across the cloud stack to create a time-series for entropy (e.g. symptoms).These symptoms can be organized hierarchically and across many cloud-ecosystems to increase scalability. These symptoms can also address the uncertainty in anomaly data. This contribution is possible since the multi-value decision diagram (MDD) for the structure of Multiple-Valued Logic (MVL) is shown to be highly efficient in terms of rapid analytical performance and is adapted to integrate the effects of measurements for multiple values. Naive Bayes Classifier (NBC) with an influence diagram (ID) are used to create time-series diagnosis technique to categorize and detect anomalies/faults during runtime to approximate the influence of each self-healing system component as to systems functioning and reliability. For the utility cloud service scenarios, the results of experiment show the viability of the presented approach. With an average of 0.89% improvement in the accuracy of anomaly diagnosis on threshold methods and an average improvement of 0.04% for false alarm rates with the near-optimum threshold method the presented approach outperforms other methods.

Diagnosing Application-network Anomalies for Millions of IPs in Production Clouds.

RAIN: Towards Real-Time Core Devices Anomaly Detection Through Session Data in Cloud Network

Root Cause Analysis of Anomalies of Multitier Services in Public Clouds.

A survey of cloud network fault diagnostic systems and tools

Proactive Telemetry in Large-Scale Multi-Tenant Cloud Overlay Networks

Intelligent Anomaly Detection and Mitigation in Data Centers

FlowPinpoint: Localizing Anomalies in Cloud-client Services for Cloud Providers

Zero Overhead Monitoring for Cloud-native Infrastructure Using RDMA

AAsclepius: Monitoring, Diagnosing, and Detouring at the Internet Peering Edge

Towards Automatic Root Cause Diagnosis of Persistent Packet Loss in Cloud Overlay Network

Utility Cloud: A Novel Approach for Diagnosis and Self-healing Based on the Uncertainty in Anomalous Metrics

Non-Intrusive Anomaly Detection with Streaming Performance Metrics and Logs for DevOps in Public Clouds: A Case Study in AWS

Achelous: Enabling Programmability, Elasticity, and Reliability in Hyperscale Cloud Networks.

Zero+: Monitoring Large-Scale Cloud-Native Infrastructure Using One-Sided RDMA

Network Anomaly Detection and Localization

A Real-Time Trace-Level Root-Cause Diagnosis System in Alibaba Datacenters

CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

CloudPin: A Root Cause Localization Framework of Shared Bandwidth Package Traffic Anomalies in Public Cloud Networks

CloudDet: Interactive Visual Analysis of Anomalous Performances in Cloud Computing Systems

Locating Anomaly Clues for Atypical Anomalous Services: An Industrial Exploration

Utility Cloud