Abstract:In the production environment, a large part of microservice failures are related to the complex and dynamic interactions and runtime environments, such as those related to multiple instances, environmental configurations, and asynchronous interactions of microservices. Due to the complexity and dynamism of these failures, it is often hard to reproduce and diagnose them in testing environments. It is desirable yet still challenging that these failures can be detected and the faults can be located at runtime of the production environment to allow developers to resolve them efficiently. To address this challenge, in this paper, we propose MEPFL, an approach of latent error prediction and fault localization for microservice applications by learning from system trace logs. Based on a set of features defined on the system trace logs, MEPFL trains prediction models at both the trace level and the microservice level using the system trace logs collected from automatic executions of the target application and its faulty versions produced by fault injection. The prediction models thus can be used in the production environment to predict latent errors, faulty microservices, and fault types for trace instances captured at runtime. We implement MEPFL based on the infrastructure systems of container orchestrator and service mesh, and conduct a series of experimental studies with two opensource microservice applications (one of them being the largest open-source microservice application to our best knowledge). The results indicate that MEPFL can achieve high accuracy in intraapplication prediction of latent errors, faulty microservices, and fault types, and outperforms a state-of-the-art approach of failure diagnosis for distributed systems. The results also show that MEPFL can effectively predict latent errors caused by real-world fault cases.

Guardian of the Resiliency: Detecting Erroneous Software Changes Before They Make Your Microservice System Less Fault-Resilient

Identifying Bad Software Changes Via Multimodal Anomaly Detection for Online Service Systems

Understanding and Improving Change Risk Detection in Practice

MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination Indexing

Run‐time Error Detection of Space‐robot Based on Adaptive Redundancy

Identifying Erroneous Software Changes through Self-Supervised Contrastive Learning on Time Series Data

MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

Multilayered Fault Detection and Localization With Transformer for Microservice Systems

Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs.

Approach to Anomaly Detection in Microservice System with Multi-Source Data Streams

Evaluating the Risk of Changes in a Microservices Architecture

Fault Diagnosis for Test Alarms in Microservices Through Multi-source Data

Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture

FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation

Localizing Failure Root Causes in a Microservice through Causality Inference

An Empirical Study on Change-induced Incidents of Online Service Systems.

SLIM: a Scalable Light-weight Root Cause Analysis for Imbalanced Data in Microservice

Root-Cause Metric Location for Microservice Systems Via Log Anomaly Detection

Research on Microservice Anomaly Detection Technology Based on Conditional Random Field

MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications

Rapid and Robust Impact Assessment of Software Changes in Large Internet-Based Services