Monika Steidl,Benedikt Dornauer,Michael Felderer,Rudolf Ramler,Mircea-Cristian Racasan,Marko Gattringer
Abstract:Deviations from expected behavior during runtime, known as anomalies, have become more common due to the systems' complexity, especially for microservices. Consequently, analyzing runtime monitoring data, such as logs, traces for microservices, and metrics, is challenging due to the large volume of data collected. Developing effective rules or AI algorithms requires a deep understanding of this data to reliably detect unforeseen anomalies. This paper seeks to comprehend anomalies and current anomaly detection approaches across diverse industrial sectors. Additionally, it aims to pinpoint the parameters necessary for identifying anomalies via runtime monitoring data.
Therefore, we conducted semi-structured interviews with fifteen industry participants who rely on anomaly detection during runtime. Additionally, to supplement information from the interviews, we performed a literature review focusing on anomaly detection approaches applied to industrial real-life datasets.
Our paper (1) demonstrates the diversity of interpretations and examples of software anomalies during runtime and (2) explores the reasons behind choosing rule-based approaches in the industry over self-developed AI approaches. AI-based approaches have become prominent in published industry-related papers in the last three years. Furthermore, we (3) identified key monitoring parameters collected during runtime (logs, traces, and metrics) that assist practitioners in detecting anomalies during runtime without introducing bias in their anomaly detection approach due to inconclusive parameters.
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the following aspects:
1. **Understanding the diversity and characteristics of runtime exceptions**:
- The paper aims to explore different interpretations, characteristics, and examples of runtime exceptions in the industry. Due to the complexity of systems, especially with the introduction of microservice architectures, runtime exceptions are becoming more and more common. These exceptions may stem from software problems or hardware dependencies, leading to system failures or performance degradation. Therefore, understanding how to define and identify these exceptions is crucial for ensuring software quality.
- **Research Question RQ1**: What are the understandings, characteristics, and examples of exceptions in the industry?
2. **Influencing factors in choosing exception detection methods**:
- The paper discusses the considerations of enterprises when choosing exception detection methods, including rule - based methods and artificial intelligence (AI) - based methods. Rule - based methods rely on predefined thresholds and pattern matching, while AI methods use supervised and unsupervised learning models to identify abnormal patterns. Although AI methods are becoming more popular in academic research, the choices in the industry are more diverse and each has its own advantages and disadvantages.
- **Research Question RQ2**: What factors influence the industry's choice of exception detection methods?
3. **Identifying key monitoring parameters for exception detection**:
- The paper determines the key parameters for effectively detecting exceptions by analyzing runtime monitoring data (such as logs, traces, and metrics). The selection and understanding of these parameters are crucial for avoiding false positives and improving the accuracy of exception detection. For example, a stable relationship between certain parameters may indicate the absence of an exception, while changes in other parameters may be signs of an exception.
- **Research Question RQ3**: What runtime monitoring data does the industry use to identify exceptions?
### Specific problem analysis
#### RQ1: Understandings, characteristics, and examples of exceptions in the industry
- **Definition and characteristics**: Most respondents believe that an exception is a behavior that deviates from expectations, which is consistent with the definition in the IEEE standard. Exceptions can be manifested as abnormal behaviors, unforeseen or non - repeatable situations, negative effects, etc.
- **Examples**: The paper lists several practical cases, such as memory leaks, I/O operation exceptions, increased response times, etc., showing the diversity of exceptions.
#### RQ2: Influencing factors in choosing exception detection methods
- **Rule - based methods**: Almost all companies use custom - made rule - based methods because they have low computational costs, wide tool support, and can quickly detect exceptions. However, setting these rules requires in - depth expertise and may be subjective.
- **AI - based methods**: Although AI methods have received much attention in academic research, their applications in the industry are still limited. Only two companies have developed their own AI algorithms, and other companies rely on commercial AI tools (such as Dynatrace). The advantage of AI methods lies in their flexibility, but their effectiveness depends on high - quality data sets.
#### RQ3: Identifying key monitoring parameters for exception detection
- **Monitoring data types**: The paper classifies monitoring parameters into three categories: logs, traces, and metrics. Logs contain structured messages, traces record the end - to - end execution paths of requests, and metrics are time - series data of system performance.
- **Key parameters**: Through in - depth analysis of these data types, the paper determines key parameters for effectively detecting exceptions, such as response time, error messages, CPU and memory usage rates, etc. Understanding the relationships between these parameters helps to reduce false positives and improve detection accuracy.
Through the exploration of these questions, the paper provides a comprehensive understanding of exception detection for the industry, helping enterprises and developers better understand and deal with runtime exceptions.