AI Assistants for Incident Lifecycle in a Microservice Environment: A Systematic Literature Review

Dahlia Ziqi Zhou,Marios Fokaefs
2024-10-06
Abstract:Incidents in microservice environments can be costly and challenging to recover from due to their complexity and distributed nature. Recent advancements in artificial intelligence (AI) offer promising solutions for improving incident management. This paper systematically reviews primary studies on AI assistants designed to support different phases of the incident lifecycle. It highlights successful applications of AI, identifies gaps in current research, and suggests future opportunities for enhancing incident management through AI. By examining these studies, the paper aims to provide insights into the effectiveness of AI tools and their potential to address ongoing challenges in incident recovery.
Software Engineering
What problem does this paper attempt to address?
The problems that this paper attempts to solve are the complexity and challenges of event management in the microservice environment. Specifically: 1. **Complexity of event management**: Due to its dynamic and distributed characteristics, the microservice architecture makes it extremely difficult to debug and repair the system. Unlike traditional monolithic systems, microservices are composed of multiple independently running components, and these components need to communicate seamlessly to work as a whole. Identifying the root cause of an event usually requires tracking interactions across multiple microservices, which is time - consuming and error - prone. 2. **Large and complex data volume**: The microservice environment generates a large amount of log, tracking, and metric data, and it becomes very difficult to manually analyze these data for quick diagnosis and problem - solving. 3. **Four stages of the event life cycle**: According to the definition of the National Institute of Standards and Technology (NIST), the event life cycle includes Preparation, Detection, Containment, and Post - incident Analysis. In the microservice environment, each stage is more difficult to manage. For example, detection may require analyzing multiple data sources to identify anomalies, and containment may involve isolating faults in distributed services. 4. **Deficiencies of existing methods**: As the scale and complexity of the system increase, traditional methods are often unable to effectively meet the requirements of event management, resulting in longer downtime and greater economic losses. For example, the four - hour AWS outage in 2017 caused a loss of $150 million to S&P 500 companies; the 14 - hour Facebook outage in 2020 resulted in a revenue loss of approximately $90 million. To solve these problems, this paper studies the application of artificial intelligence assistants (AI Assistants) in supporting different stages of the event life cycle in the microservice environment through a systematic literature review (SLR). The paper aims to provide insights into the effectiveness and potential capabilities of AI tools to address the ongoing challenges in event recovery, and to reveal the successes and deficiencies of existing research, pointing the way for future research. ### Main research questions of the paper 1. **Which stages of the event life cycle are assisted by AI assistants?** 2. **What are the goals of these AI assistants?** 3. **What methods do these AI assistants use?** 4. **What types of data do these tools use to assist in event handling?** By answering these questions, the paper hopes to reveal the successes and deficiencies of existing research and provide key insights for the future application of AI in event management.