Abstract:Deviations from expected behavior during runtime, known as anomalies, have become more common due to the systems' complexity, especially for microservices. Consequently, analyzing runtime monitoring data, such as logs, traces for microservices, and metrics, is challenging due to the large volume of data collected. Developing effective rules or AI algorithms requires a deep understanding of this data to reliably detect unforeseen anomalies. This paper seeks to comprehend anomalies and current anomaly detection approaches across diverse industrial sectors. Additionally, it aims to pinpoint the parameters necessary for identifying anomalies via runtime monitoring data. Therefore, we conducted semi-structured interviews with fifteen industry participants who rely on anomaly detection during runtime. Additionally, to supplement information from the interviews, we performed a literature review focusing on anomaly detection approaches applied to industrial real-life datasets. Our paper (1) demonstrates the diversity of interpretations and examples of software anomalies during runtime and (2) explores the reasons behind choosing rule-based approaches in the industry over self-developed AI approaches. AI-based approaches have become prominent in published industry-related papers in the last three years. Furthermore, we (3) identified key monitoring parameters collected during runtime (logs, traces, and metrics) that assist practitioners in detecting anomalies during runtime without introducing bias in their anomaly detection approach due to inconclusive parameters.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Understanding the diversity and characteristics of runtime exceptions**: - The paper aims to explore different interpretations, characteristics, and examples of runtime exceptions in the industry. Due to the complexity of systems, especially with the introduction of microservice architectures, runtime exceptions are becoming more and more common. These exceptions may stem from software problems or hardware dependencies, leading to system failures or performance degradation. Therefore, understanding how to define and identify these exceptions is crucial for ensuring software quality. - **Research Question RQ1**: What are the understandings, characteristics, and examples of exceptions in the industry? 2. **Influencing factors in choosing exception detection methods**: - The paper discusses the considerations of enterprises when choosing exception detection methods, including rule - based methods and artificial intelligence (AI) - based methods. Rule - based methods rely on predefined thresholds and pattern matching, while AI methods use supervised and unsupervised learning models to identify abnormal patterns. Although AI methods are becoming more popular in academic research, the choices in the industry are more diverse and each has its own advantages and disadvantages. - **Research Question RQ2**: What factors influence the industry's choice of exception detection methods? 3. **Identifying key monitoring parameters for exception detection**: - The paper determines the key parameters for effectively detecting exceptions by analyzing runtime monitoring data (such as logs, traces, and metrics). The selection and understanding of these parameters are crucial for avoiding false positives and improving the accuracy of exception detection. For example, a stable relationship between certain parameters may indicate the absence of an exception, while changes in other parameters may be signs of an exception. - **Research Question RQ3**: What runtime monitoring data does the industry use to identify exceptions? ### Specific problem analysis #### RQ1: Understandings, characteristics, and examples of exceptions in the industry - **Definition and characteristics**: Most respondents believe that an exception is a behavior that deviates from expectations, which is consistent with the definition in the IEEE standard. Exceptions can be manifested as abnormal behaviors, unforeseen or non - repeatable situations, negative effects, etc. - **Examples**: The paper lists several practical cases, such as memory leaks, I/O operation exceptions, increased response times, etc., showing the diversity of exceptions. #### RQ2: Influencing factors in choosing exception detection methods - **Rule - based methods**: Almost all companies use custom - made rule - based methods because they have low computational costs, wide tool support, and can quickly detect exceptions. However, setting these rules requires in - depth expertise and may be subjective. - **AI - based methods**: Although AI methods have received much attention in academic research, their applications in the industry are still limited. Only two companies have developed their own AI algorithms, and other companies rely on commercial AI tools (such as Dynatrace). The advantage of AI methods lies in their flexibility, but their effectiveness depends on high - quality data sets. #### RQ3: Identifying key monitoring parameters for exception detection - **Monitoring data types**: The paper classifies monitoring parameters into three categories: logs, traces, and metrics. Logs contain structured messages, traces record the end - to - end execution paths of requests, and metrics are time - series data of system performance. - **Key parameters**: Through in - depth analysis of these data types, the paper determines key parameters for effectively detecting exceptions, such as response time, error messages, CPU and memory usage rates, etc. Understanding the relationships between these parameters helps to reduce false positives and improve detection accuracy. Through the exploration of these questions, the paper provides a comprehensive understanding of exception detection for the industry, helping enterprises and developers better understand and deal with runtime exceptions.

How Industry Tackles Anomalies during Runtime: Approaches and Key Monitoring Parameters

Log-based Anomaly Detection of Enterprise Software: An Empirical Study

Anomaly Detection in Railway Sensor Data Environments: State-of-the-Art Methods and Empirical Performance Evaluation

Anomaly Detection for Industrial Big Data

An investigation of challenges encountered when specifying training data and runtime monitors for safety critical ML applications

Anomaly Detection in Industrial Machinery using IoT Devices and Machine Learning: a Systematic Mapping

Evaluation of anomaly detection algorithms using machine learning methods

Automated Root Cause Analysis with Observability Data - A Comprehensive Review

Online Model-based Anomaly Detection in Multivariate Time Series: Taxonomy, Survey, Research Challenges and Future Directions

A Comparative Study of Machine Learning Algorithms for Anomaly Detection in Industrial Environments: Performance and Environmental Impact

ServiceAnomaly: An anomaly detection approach in microservices using distributed traces and profiling metrics

From Explanation to Action: An End-to-End Human-in-the-loop Framework for Anomaly Reasoning and Management

An Anomaly-based Detection System for Monitoring Kubernetes Infrastructures

A Survey: Industrial Anomaly Detection based on Data Mining

Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services

A Survey on Unsupervised Anomaly Detection Algorithms for Industrial Images

Advanced Anomaly Detection in Manufacturing Processes: Leveraging Feature Value Analysis for Normalizing Anomalous Data

AutoML: state of the art with a focus on anomaly detection, challenges, and research directions

Towards Meaningful Anomaly Detection: The Effect of Counterfactual Explanations on the Investigation of Anomalies in Multivariate Time Series

An empirical investigation of challenges of specifying training data and runtime monitors for critical software with machine learning and their relation to architectural decisions