Xuchao Zhang,Supriyo Ghosh,Chetan Bansal,Rujia Wang,Minghua Ma,Yu Kang,Saravan Rajmohan
Abstract:Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud services, requiring on-call engineers to identify the primary issues and implement corrective actions to prevent future recurrences. Improving the incident RCA process is vital for minimizing service downtime, customer impact and manual toil. Recent advances in artificial intelligence have introduced state-of-the-art Large Language Models (LLMs) like GPT-4, which have proven effective in tackling various AIOps problems, ranging from code authoring to incident management. Nonetheless, the GPT-4 model's immense size presents challenges when trying to fine-tune it on user data because of the significant GPU resource demand and the necessity for continuous model fine-tuning with the emergence of new data. To address the high cost of fine-tuning LLM, we propose an in-context learning approach for automated root causing, which eliminates the need for fine-tuning. We conduct extensive study over 100,000 production incidents, comparing several large language models using multiple metrics. The results reveal that our in-context learning approach outperforms the previous fine-tuned large language models such as GPT-3 by an average of 24.8\% across all metrics, with an impressive 49.7\% improvement over the zero-shot model. Moreover, human evaluation involving actual incident owners demonstrates its superiority over the fine-tuned model, achieving a 43.5\% improvement in correctness and an 8.7\% enhancement in readability. The impressive results demonstrate the viability of utilizing a vanilla GPT model for the RCA task, thereby avoiding the high computational and maintenance costs associated with a fine-tuned model.
What problem does this paper attempt to address?
The paper aims to address the issue of Root Cause Analysis (RCA) in the incident diagnosis process of cloud services, particularly focusing on how to effectively automate this process to reduce service downtime, minimize customer impact, and alleviate manual workload. To tackle the challenges present in traditional methods, such as the high cost and complexity of fine-tuning large language models (LLMs), the authors propose an in-context learning-based approach to automatically perform root cause analysis without the need for model fine-tuning.
Specifically, the key contributions of the paper include:
1. **Innovative In-Context Learning Method**: A new in-context learning method is proposed, utilizing relevant historical incident data as examples directly input into large language models (e.g., GPT-4), enabling the model to acquire domain-specific knowledge without the time-consuming and expensive fine-tuning process.
2. **Large-Scale Empirical Study**: A large-scale experimental evaluation was conducted on a real dataset from a major cloud service provider, which includes over 100,000 incident records from more than 1,000 different services. The results show that the proposed in-context learning method improves performance by an average of 24.7% across all evaluation metrics compared to the fine-tuned GPT-3 model, and by 49.7% over the zero-shot model.
3. **Human Validation Study**: The effectiveness of the method was further demonstrated through a human validation study involving actual incident handlers, achieving significant improvements of 43.5% in correctness and 8.7% in readability.
4. **Methodological Exploration**: Through a series of research questions (RQs), various aspects of the in-context learning method were explored in depth, including whether comparable performance can be achieved using only standard, non-fine-tuned LLMs, the effectiveness of traditional retrieval-augmented methods, and the impact of the number and relevance of context examples on performance.
5. **Technical Architecture**: The technical architecture of the proposed in-context learning-based root cause analysis framework is detailed, including steps such as data preparation, example extraction, retrieval index construction, and root cause generation.
In summary, the paper proposes an effective in-context learning-based method for automated root cause analysis tasks and validates its effectiveness through empirical studies, providing a new solution for cloud service incident management.