Abstract:Although tremendous efforts have been devoted to the quality assurance of online service systems, in reality, these systems still come across many incidents (i.e., unplanned interruptions and outages), which can decrease user satisfaction or cause economic loss. To better understand the characteristics of incidents and improve the incident management process, we perform the first large-scale empirical analysis of incidents collected from 18 real-world online service systems in Microsoft. Surprisingly, we find that although a large number of incidents could occur over a short period of time, many of them actually do not matter, i.e., engineers will not fix them with a high priority after manually identifying their root cause. We call these incidents incidental incidents. Our qualitative and quantitative analyses show that incidental incidents are significant in terms of both number and cost. Therefore, it is important to prioritize incidents by identifying incidental incidents in advance to optimize incident management efforts. In particular, we propose an approach, called DeepIP (Deep learning based Incident Prioritization), to prioritizing incidents based on a large amount of historical incident data. More specifically, we design an attention-based Convolutional Neural Network (CNN) to learn a prediction model to identify incidental incidents. We then prioritize all incidents by ranking the predicted probabilities of incidents being incidental. We evaluate the performance of DeepIP using real-world incident data. The experimental results show that DeepIP effectively prioritizes incidents by identifying incidental incidents and significantly outperforms all the compared approaches. For example, the AUC of DeepIP achieves 0.808, while that of the best compared approach is only 0.624 on average.

Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems

Multi-stage Location for Root-Cause Metrics in Online Service Systems.

Root-Cause Metric Location for Microservice Systems Via Log Anomaly Detection

Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services

Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching

CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis

Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture

Identifying Root-Cause Changes for User-Reported Incidents in Online Service Systems

AutoMAP: Diagnose Your Microservice-based Web Applications Automatically.

An Empirical Investigation of Incident Triage for Online Service Systems

MS-Rank: Multi-Metric and Self-Adaptive Root Cause Diagnosis for Microservice Applications

HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources

An Empirical Investigation of Practical Log Anomaly Detection for Online Service Systems.

Performance Issue Diagnosis for Online Service Systems

OCRCL: Online Contrastive Learning for Root Cause Localization of Business Incidents

Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition

Automated Root Cause Analysis with Observability Data - A Comprehensive Review

MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

How Incidental Are the Incidents?

How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems

An Empirical Study on Change-induced Incidents of Online Service Systems.